Extremely High Latency after switch brought up (FCOE)
We recently had planned maintenance at a site where everything was gracefully shut down and then powered on two days later. Our Cisco Nexus switches were brought up, and then our EMC VNX2 was brought online. Afterwards our four VMhosts were brought online. Immediately the VM's on one host were unresponsive and taking forever to start. Looking into esxtop, I see that the DAVG and KAVG was extremely high, 5000ms - 30000ms which is unacceptable. This normally points to storage but the storage was working fine for other hosts and physical servers. This was only happening on one host. I rebooted the host and had the same issue.
Everything is using FCOE
Physical servers attached to the same storage through the same Nexus switches were fine
We use Dell M630's with Emulex CNA's
The server having the issue was purchased a about 9 months ago and was brought online fine at that time.
Two newer M630's worked fine, the did have one revision higher of the Emulex firmware.
I updated the firmware in the Emulex and when the host rebooted it was fine. However, I don't think the firmware was the issue since it was brought online before with no issue. I think the update process must have reset the HBA which resolved the issue.
The other thing is this happened at our main site a year ago during a Nexus upgrade but was a mix of M620/M630 with Emulex and some with Broadcom adapters. With EMC, Cisco, and VMware on the phone we were not able to find the cause of the issue either. We powered off all VM hosts and that brought them back online which resolved the issue.
I am using ESXi5.5. Any thoughts on this since it has now happened to two different sites and with different HBA's?