ESXi 5.0 U2 Froze when 1 SAN switch hiccuped

DSeaman · ‎02-13-2013

Environment: ESXi 5.0 U2, Cisco UCS Blade servers, redundant Cisco MDS FC switches in separate fabrics, and Fibre Channel storage array. The FC array is redundantly connected to the MDS switches. ESXi configured for active/active round robin (per array vendor best practices).

Today one of the MDS switches had a brain fart of some type (still investigating), so one of the two fabrics went down. The Windows Server 2012 physical UCS server that had LUNs presented maintained I/O connectivity through the one operating switch. However, the dozen ESXi servers did not gracefully handle the fabric hiccup and basically froze. Could not connect via the vSphere client and the DCUI was unresponsive once you authenticated. VMs also froze, but could still ping.

After realizing it was a fabric problem, we admin shut the MDS uplinks on the affected switch to the UCS FI, and then all of a sudden the ESXi hosts unfroze and started sending I/Os down the other (non-affected) fabric.

During the installation of the array a few months ago, we did thorough HA testing including turing off SAN switches, pulling cables from the array, pulling UCS FI cables, etc., and had no problems like what we had today. But that was using ESXi 5.0 U1.

Anyone have an idea why, with two valid paths to each LUN, ESXi froze? Only when the UCS FIs downed the server HBA ports did ESXi wake up and re-route the I/Os to the other fabric. The APD condition should not have been triggered, since two paths via the other fabric were 100% available.

Derek Seaman

Sreec · ‎02-13-2013

Hi,

You are right APD(http://kb.vmware.com/kb/1016626) condition wont trigger as long the last avaialble path is active.

"Could not connect via the vSphere client and the DCUI was unresponsive once you authenticated. VMs also froze, but could still ping" .It's pretty clear that hostd agent was hung!! At times i have seen DCUI also freezing out which happened in this case .I believe you have rebooted the host to recover from the scenario.Is there a chance where in you can check or upload the logs? If there are no dead paths most likely you will see SCSI reservation errors which will certainly freeze the hostd agent.

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

DSeaman · ‎02-13-2013

The ESXi servers recovered without rebooting when we downed the MDS to UCS FC links on the bad switch. We haven't yet had time to comb through the logs, since it took a few hours to recover the environment and restore all services. I did open a support case with VMware, since in my mind they should not freeze and stronger evidence of two good paths was the fully functional Windows physical server.

Derek Seaman

Sreec · ‎02-13-2013

Hi,

Yes i completely agree with your findings.The good part is we still have the logs available since there was no reboot required.I you would like to have a check on the same,please do paste the same(vmkernel logs).I'm pretty sure Vmware support will share the right information with you.

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

DSeaman · ‎02-15-2013

So VMware finally got back to me, and I'm not too reassured by their explanation of what happened. After looking through the host logs for four hosts, he came to the conclusion that the SAN switch (Cisco MDS) did not send any SCSI sense code to the hosts when the device alias table was deleted on the switch (resulting in the loss of zoning WWNs and stopping I/Os). Even though two paths were still available to the LUNs, ESXi and hostd were waiting on I/Os from the black hole path. There's no timeout for that particular wait (without a SCSI sense code), so ESXi hangs. Only when we admin shut the FC interfaces to the UCS blades did ESXi get a SCSI code that something was down and utilize the other two active paths.

As I mentioned before, a physical Windows Server 2012 UCS blade gracefully handled the exact same situation without any noticable I/O problems. So that tells me it is possible for the OS to detect such a condition, even without a SCSI sense code from the switch, and use the other two paths.

The short answer is, in this particular failure scenario of the SAN switch, ESXi will hang forever until it gets some type of other feedback that something is wrong. I really can't believe that is by design...shouldn't it have some type of I/O timer and use the other two good paths??

Derek Seaman

All

ESXi 5.0 U2 Froze when 1 SAN switch hiccuped