VMware Cloud Community
Wolfman017
Contributor
Contributor

"All hosts contribution stats" warning

Hi,

I have a stretched cluster of 2 ESXi 6.5u2 (SERVER1 and SERVER2) + 1 Witness (WITNESS1).

When I check the vSAN health screen everything is OK except one which is in Warning : "All hosts contributing stats". The warning says that the host SERVER2 is not contributing to stats.

I followed this tutorial to determine which server was the stats master : vSAN hosts not contributing stats reports - vSAN Health check fails - vhabit

My stats Master is SERVER1

If I check the vsanmgmt.log file, I have this warning :

2019-01-09T14:21:49Z VSANMGMTSVC: WARNING vsanperfsvc[Collector-1] [statscollector::RetrieveRemoteStats] Error happened during RetrieveRemoteStats of host IP_of_SERVER2, type: <class 'socket.timeout'>, message: timed out

2019-01-09T14:21:50Z VSANMGMTSVC: WARNING vsanperfsvc[Collector-1] [statscollector::RetrieveRemoteStats] Error happened during RetrieveRemoteStats of host IP_of_WITNESS1, type: <class 'socket.timeout'>, message: timed out

Communications between servers work well. I checked it with vmkping. The rest of the vSAN works perfectly.

Any idea ?

0 Kudos
11 Replies
TheBobkin
Champion
Champion

Hello Wolfman017​,

First step would be to restart vsanmgmtd and confirm that this is functional:

# ps | grep vsan

#/etc/init.d/vsanmgmtd stop

Check that the processes are gone (in 6.7 there will be other 'vsan' named processes but don't mind these):

# ps | grep vsan

#/etc/init.d/vsanmgmtd start

When you are testing ping are you using vmkping and specifying what interface to go over? (-i vmkX)

Bob

0 Kudos
Wolfman017
Contributor
Contributor

Hi,

I already stopped/restarted the service while activating the debug mode. I even rebooted both nodes of the cluster, and it keeps doing the same.

When doing the vmkping, I specify the vSAN VMKernel port.I did not specify it, but the IP addresses I talk about (when testing vmkping and in the log) are the vSAN IP, not the management IP.

0 Kudos
Darking
Enthusiast
Enthusiast

Aloha!

I would try to look into the log with the debug settings, in case you havnt.

Here is the settings you need to make on the MASTER node:

VMware Knowledge Base

I have a very similar case running with support at the moment, in our 8 node stretched cluster, and my debug events look like this:

2019-01-10T12:13:41Z VSANMGMTSVC: DEBUG vsanperfsvc[Collector-Main] [statscollector::_FetchAndCalculateStats] waiting for stats readiness

2019-01-10T12:13:44Z VSANMGMTSVC: WARNING vsanperfsvc[Collector-2] [statscollector::RetrieveRemoteStats] Error happened during RetrieveRemoteStats of host 172.29.242.13, type: <class 'OSError'>, message: [Errno 113] No route to host

For some very odd reason it is trying to use the interface that is running dedicated witness traffic. which only has traffic allowed to the witness site, and not intersite.

I would assume it would either use the vmware management network or the VSAN network.. but nope.. mine is trying the witness traffic network.

when i hear back from support i'll post it here.

0 Kudos
Wolfman017
Contributor
Contributor

The servers communicate on the vSAN network. What do you call "witness network" ? The "witnessPg" port group ?

          pastedImage_0.png

Here is the log with the debug activated :

2019-01-10T13:54:17Z VSANMGMTSVC: WARNING vsanperfsvc[Collector-1] [statscollector::RetrieveRemoteStats] Error happened during RetrieveRemoteStats of host VSAN_IP_of_SERVER2, type: <class 'socket.timeout'>, message: timed out

2019-01-10T13:54:17Z VSANMGMTSVC: DEBUG vsanperfsvc[Collector-1] [statscollector::RetrieveRemoteStats] Traceback (most recent call last):    File "/build/mts/release/bora-10642691/bora/build/vsan/release/vsanhealth/usr/lib/vmware/vsan/perfsvc/statscollector.py", line 676, in RetrieveRemoteStats    File "/build/mts/release/bora-10719125/bora/build/esx/release/vmvisor/sys/lib64/python3.5/site-packages/pyVmomi/VmomiSupport.py", line 557, in <lambda>    File "/build/mts/release/bora-10719125/bora/build/esx/release/vmvisor/sys/lib64/python3.5/site-packages/pyVmomi/VmomiSupport.py", line 363, in _InvokeMethod    File "/build/mts/release/bora-10719125/bora/build/esx/release/vmvisor/sys/lib64/python3.5/site-packages/pyVmomi/SoapAdapter.py", line 1303, in InvokeMethod    File "/build/mts/release/bora-10719125/bora/build/esx/release/vmvisor/sys/lib64/python3.5/site-packages/pyVmomi/SoapAdapter.py", line 1369, in GetConne

2019-01-10T13:54:17Z VSANMGMTSVC: WARNING vsanperfsvc[Collector-2] [statscollector::RetrieveRemoteStats] Error happened during RetrieveRemoteStats of host VSAN_IP_of_WITNESS, type: <class 'socket.timeout'>, message: timed out

2019-01-10T13:54:17Z VSANMGMTSVC: DEBUG vsanperfsvc[Collector-2] [statscollector::RetrieveRemoteStats] Traceback (most recent call last):    File "/build/mts/release/bora-10642691/bora/build/vsan/release/vsanhealth/usr/lib/vmware/vsan/perfsvc/statscollector.py", line 676, in RetrieveRemoteStats    File "/build/mts/release/bora-10719125/bora/build/esx/release/vmvisor/sys/lib64/python3.5/site-packages/pyVmomi/VmomiSupport.py", line 557, in <lambda>    File "/build/mts/release/bora-10719125/bora/build/esx/release/vmvisor/sys/lib64/python3.5/site-packages/pyVmomi/VmomiSupport.py", line 363, in _InvokeMethod    File "/build/mts/release/bora-10719125/bora/build/esx/release/vmvisor/sys/lib64/python3.5/site-packages/pyVmomi/SoapAdapter.py", line 1303, in InvokeMethod    File "/build/mts/release/bora-10719125/bora/build/esx/release/vmvisor/sys/lib64/python3.5/site-packages/pyVmomi/SoapAdapter.py", line 1369, in GetConne

0 Kudos
Darking
Enthusiast
Enthusiast

Im referring to the witness traffic LAN.

Its a new function in 6.7 U1 im running, i forgot you are not running this release. sorry about that.

in 6.5 the VSAN network is used both for witness and VSAN traffic.

have you tried a vmkping?

vmkping -I <interface of VSAN> destination-of-other-host

else i would check if a firewall has been setup that is blocking (both some sort of physical firewall or on the esxi hosts)

0 Kudos
Darking
Enthusiast
Enthusiast

Hi wolfman017

any Update on your issue?

i created a case a week ago with gss, and unfortunately I have not yet received a analysis or resolution on the problem.

they are telling me they will report in tomorrow but level 2 tech and his senior were stumped 😕

0 Kudos
Wolfman017
Contributor
Contributor

Hi,

No, no news. I sent the issue to our TAM, and it is not a known issue.

We have another vSAN Cluster on the same vCenter, and I have the same issue.

I just build another vSAN Cluster (exactly the same as the one with the issue) on another vCenter, and I don't have the issue.

0 Kudos
Beingnsxpaddy
Enthusiast
Enthusiast

Did you try restarting the management services on the ESXi Host, and can you add a new host in same cluster to see the behaviour.

Regards Pradhuman VCIX-NV, VCAP-NV, vExpert, VCP2X-DCVNV If my Answer resolved your query don't forget to mark it as "Correct Answer".
0 Kudos
TheBobkin
Champion
Champion

Hello Wolfman017 / Darking

Do either of you by any chance have port 80 traffic blocked between the hosts and/or between hosts and vCenter?

I ask as this only got added to our required ports documentation for this service fairly recently:

vSAN Network Port Requirements 

Bob

Darking
Enthusiast
Enthusiast

Hi TheBobkin

We are still talking with GSS and got a PR made.

it seems to be some network configuration issue in our end, but it took a while for gss to figure out what was going on

Now we see there is an article on the storagehub that describes the setup in more detail, and we are in progress of making this change:

I would swear that it didnt exist when we started the setup, but who knows!

Somehow we got it into our heads that WTS was supposed to be seperated out completely for each site, aka no routes between Primary and Secondary Site.. Now we see the design schematics that we were wrong Smiley Happy

I'll get back tomorrow when we reconfigured the routes for the witness site, and created a single vlan for the witness traffic between the primary and secondary site.

0 Kudos
Darking
Enthusiast
Enthusiast

Just a small update on our issue.

We seem to have it resolved now by establishing a route between the two sites for the WTS traffic.

A all in all logical solution, but we did not find the entire process of setting up Witness Traffic Seperation very well documented when we did the initial setup. We will do extensive failover site testing tomrrow and hopefully everything will work as expected.

0 Kudos