VM Failure during Switch Reboot

atoerper · ‎10-11-2017

We have a 2-node cluster setup at a remote facility. Each node has 2x10Gb ports used for all traffic(VMKernel, VM, Mgmt, vMotion). Port A on each host goes to Switch A. Port B on each host goes to Switch B. The switches were rebooted 1 at a time because of switch code upgrades. During the process, all of the VMs were powered off for no apparent reason.

The port group for VSAN is using 2 NICs both set as active using the Originating Virtual Port LB policy and Fallback is set to No.

Any ideas what happened here?

SureshKumarMuth · ‎10-11-2017

Did you get a chance to check the status of host and VMs after rebooting the first switch ? Are you able to bring the VMs up and are they running fine now ?What does the log show ? Are you seeing any HA related events recorded?

Regards,
Suresh
https://vconnectit.wordpress.com/

atoerper · ‎10-12-2017

Here is a snapshot of some of the events before one of the VMs was powered off. The events are listed in newest to oldest order.

VMNAME is powered off

Configuration file for VMNAME cannot be found

Renamed VMNAMEfrom VMNAME to /vmfs/volumes/vsan:525cb476734a77d1-9e6dd64dfe3083b0/afa4c559-5e11-d9fe-b8ff-1402ec953578/DKGMPTL1.vmx

Renamed VMNAME from VMNAME to /vmfs/volumes/vsan:525cb476734a77d1-9e6dd64dfe3083b0/a697c559-a8a4-1e7c-34b7-1402ec953578/DKGM0600.vmx

Configuration file for VMNAME cannot be found

Renamed VMNAME from VMNAME to /vmfs/volumes/vsan:525cb476734a77d1-9e6dd64dfe3083b0/008dd359-c045-560f-6afa-1402ec953578/MSDC-P-G01.vmx

User root@127.0.0.1 logged out (login time: Tue Oct 10 06:58:54 EDT 2017, number of API invocations: 0, user agent: )

Host cannot communicate with one or more other nodes in the vSAN enabled cluster

Alarm 'Network uplink redundancy lost' on hostname changed from Green to Red

Alarm 'Network uplink redundancy lost' on hostname triggered an action

Alarm 'Network uplink redundancy lost': an SNMP trap for entity hostname was sent

vSphere HA agent is healthy

The vSphere HA availability state of this host has changed to Master

Task: Update vSAN configuration

The vSphere HA availability state of this host has changed to Election

The vSphere HA availability state of this host has changed to Unreachable

User root@127.0.0.1 logged in as

Device or filesystem with identifier bbdaaffe-b043b2ff has entered the All Paths Down state.

Lost uplink redundancy on DVPorts: 23/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "15/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "7/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "31/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "24/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "28/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "27/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69", "29/f1 ee 0c 50 9b d5 db ab-15 5d 33 4d 5f fa ab 69". Physical NIC vmnic4 is down."

Lost access to volume 59aef088-2dd9c2d1-a533-1402ec953578 (87f0ae59-8b60-0a7c-b1a4-1402ec953578) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Lost access to volume 59b2f295-f868c64a-f1aa-1402ec953578 (94f2b259-791b-f447-e916-1402ec953578) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Lost access to volume 59c597a6-2d735090-25ce-1402ec953578 (a697c559-a8a4-1e7c-34b7-1402ec953578) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Lost access to volume 59c597a6-509b8aa4-3133-1402ec953578 (a697c559-9262-3aa0-d0ba-1402ec953578) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Lost access to volume 59c5a4af-addb738a-32c7-1402ec953578 (afa4c559-5e11-d9fe-b8ff-1402ec953578) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Lost access to volume 59d38d00-b7a4d91c-c8a7-1402ec953578 (008dd359-c045-560f-6afa-1402ec953578) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

SureshKumarMuth · ‎10-12-2017

Host cannot communicate with one or more other nodes in the vSAN enabled cluster

There was some connectivity issues during the switch reboot.

In normal scenario are you able to see if the traffic is passing through both the nics ? And is the issue happened only on few VMs or all the VMs in the VSAN cluster.

Take putty session (ssh) to the host and run esxtop and press 'n' to see network page. Check if you are able to see the traffic passes through both the vmnics.

Regards,
Suresh
https://vconnectit.wordpress.com/

atoerper · ‎10-13-2017

Traffic is passing through both NICs