I am facing intermittent vsan health check failure with esxi 6.5. we have 3 nodes on cluster and L2 connectivity between them with lag vsan vmk are not able to reach each other intermittently. We can see vmkping between them. Can someone help me on this
Specifically which Health alert is being triggered? (This one perhaps? kb.vmware.com/kb/2108011)
Are specific nodes to nodes triggering the alarm consistently or is it all nodes?(drill-down of the Alert should note these)
Is this a stretched-cluster?
Is this intermittant and is it causing a proper cluster partition (VMs become inaccessible/go down)?
If it is the alert in the kb article above then follow the steps to do the recommended checks.
Okay, just to clarify - Is one host not able to reach just one other host when this occurs or is one host not able to communicate with any other hosts?
If just one host to another single host, is it always the same host-host connection?
If it is just one host that cannot communicate with all the others then check the NIC stats on this host using nicinfo.sh (/usr/lib/vmware/vmware-support/bin/nicinfo.sh) and/or esxcli network stats get, look on the switch for any errors on the associated port if you know what you are looking at.
Either way I would advise taking a closer look at your network configuration for vSAN on the affected host(s) and ensure best practices have been applied and nothing misconfigured.
Its not host specific, the issue occurring with all host in the cluster randomly. Please note the there is no issue with management, issue with vsan/motion vmkernel ports.
3 nodes with 10G.
LAG with src/dst ip tcp/udp ports load balancing
LAD as primary uplink for the dedicated port group.
Single VMK private ip on each host .
This is the setup with us