A couple of times this week we have had the errors below on a host in a cluster:
'vSAN Health Alarm 'Hosts with connectivity issues'
'vSAN Health Alarm 'Stats master election'
We find the NICs look ok on the host and switches, but VMs ping times become irregular, so 1ms for a few swconds then in the 100s. What we have to do is migrate the VMs off the host to another, put into maintenance mode and reboot then the host is fine again for a few days.
The only job running in the background on the cluster is a vSAN rebalance where we have added 2 more hosts, this rebalance can take 24-48 hours.
This is ESXi 6.7 6.7.0, 10764712.
What host logs should I be looking at? I guess we may need to get to update 3.
Welcome to Communities.
The Health alarms you mentioned are *generally* attributed issues with vsanmgmtd on the impacted host - that being said, if there are severe enough issues with vsanmgmtd then placing the node into MM may not be possible which in your case it was and thus other factors (like network in general) should also be looked at.
Where are you seeing the network latency spike - on the vmk used for Management or the vmk used for vSAN traffic?
Regardless, you should be looking at this in more depth (validate end-to-end MTU is correct, using supported driver+firmware on the NICs, no indication of physical issues such as CRC errors/dropped packets or general contention under load).
In general you should (almost always) start with looking at vmkernel.log and vobd.log from times when the issue is occurring, in this case I would also advise looking at vsanmgmtd.log and the output of nicinfo.sh (/usr/lib/vmware/vm-support/bin/nicinfo.sh).
I would advise updating to 6.7 U3 regardless as we have made a lot of improvements and fixes between the build you are on and 6.7 U3/P01.