We have a 2-node cluster setup at a remote facility. Each node has 2x10Gb ports used for all traffic(VMKernel, VM, Mgmt, vMotion). Port A on each host goes to Switch A. Port B on each host goes to Switch B. The switches were rebooted 1 at a time because of switch code upgrades. During the process, all of the VMs were powered off for no apparent reason.
The port group for VSAN is using 2 NICs both set as active using the Originating Virtual Port LB policy and Fallback is set to No.
Any ideas what happened here?
Did you get a chance to check the status of host and VMs after rebooting the first switch ? Are you able to bring the VMs up and are they running fine now ?What does the log show ? Are you seeing any HA related events recorded?
Here is a snapshot of some of the events before one of the VMs was powered off. The events are listed in newest to oldest order.
VMNAME is powered off |
Configuration file for VMNAME cannot be found |
Renamed VMNAMEfrom VMNAME to /vmfs/volumes/vsan:525cb476734a77d1-9e6dd64dfe3083b0/afa4c559-5e11-d9fe-b8ff-1402ec953578/DKGMPTL1.vmx |
Renamed VMNAME from VMNAME to /vmfs/volumes/vsan:525cb476734a77d1-9e6dd64dfe3083b0/a697c559-a8a4-1e7c-34b7-1402ec953578/DKGM0600.vmx |
Configuration file for VMNAME cannot be found |
Renamed VMNAME from VMNAME to /vmfs/volumes/vsan:525cb476734a77d1-9e6dd64dfe3083b0/008dd359-c045-560f-6afa-1402ec953578/MSDC-P-G01.vmx |
User root@127.0.0.1 logged out (login time: Tue Oct 10 06:58:54 EDT 2017, number of API invocations: 0, user agent: ) |
Host cannot communicate with one or more other nodes in the vSAN enabled cluster |
Alarm 'Network uplink redundancy lost' on hostname changed from Green to Red |
Alarm 'Network uplink redundancy lost' on hostname triggered an action |
Alarm 'Network uplink redundancy lost': an SNMP trap for entity hostname was sent |
vSphere HA agent is healthy |
The vSphere HA availability state of this host has changed to Master |
Task: Update vSAN configuration |
The vSphere HA availability state of this host has changed to Election |
The vSphere HA availability state of this host has changed to Unreachable |
User root@127.0.0.1 logged in as |
Device or filesystem with identifier bbdaaffe-b043b2ff has entered the All Paths Down state. |
Lost uplink redundancy on DVPorts: 23/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "15/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "7/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "31/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "24/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "28/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "27/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "29/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69". Physical NIC vmnic4 is down." |
Lost access to volume 59aef088-2dd9c2d1-a533-1402ec953578 (87f0ae59-8b60-0a7c-b1a4-1402ec953578) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. |
Lost access to volume 59b2f295-f868c64a-f1aa-1402ec953578 (94f2b259-791b-f447-e916-1402ec953578) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. |
Lost access to volume 59c597a6-2d735090-25ce-1402ec953578 (a697c559-a8a4-1e7c-34b7-1402ec953578) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. |
Lost access to volume 59c597a6-509b8aa4-3133-1402ec953578 (a697c559-9262-3aa0-d0ba-1402ec953578) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. |
Lost access to volume 59c5a4af-addb738a-32c7-1402ec953578 (afa4c559-5e11-d9fe-b8ff-1402ec953578) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. |
Lost access to volume 59d38d00-b7a4d91c-c8a7-1402ec953578 (008dd359-c045-560f-6afa-1402ec953578) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. |
Host cannot communicate with one or more other nodes in the vSAN enabled cluster
There was some connectivity issues during the switch reboot.
In normal scenario are you able to see if the traffic is passing through both the nics ? And is the issue happened only on few VMs or all the VMs in the VSAN cluster.
Take putty session (ssh) to the host and run esxtop and press 'n' to see network page. Check if you are able to see the traffic passes through both the vmnics.
Traffic is passing through both NICs