VMware Cloud Community
weestro
Contributor
Contributor

Interesting problem with ESX35 and EVA8100

Environment:

2 x DL585 G1 hosts/ QLA HBAs

2 x Brocade 4100 FC switches (A/B fabrics)

1 x EVA 8100 (2 SPs, 4 ports/ SP/ fabric)

- 9 LUNs presented to each host zoned across all SP ports in each fabric (8 total paths per LUN per server)

- Fixed pathing policy set to for LUN access each on a different SP port (backups LUN is shared with another tier-2 LUN)

So last week FC switch for the A fabric panicked and rebooted itself in the middle of the day. We're running an older firmware that was required for the HDS array we just migrated off of. All active LUNs being accessed on SP-A went dead, these VMs died. All VMs living on LUNs accessed via SP-B (fabric B) remained running. Unsure what the problem was yet and in an effort to get the servers back up the admins started bouncing the ESX hosts. The critical VMs were down anyway. On reboot the hosts would hang on the HBA scans for several minutes unable to find the previously attached LUNs. Eventually the EVA SP rebooted itself and the LUNs came back with no corruption and the VMs were able to be powered on.

We have cases open with VMware, HP, and Brocade on this. Apparently the EVA SP created zombie paths that were not marked down properly so ESX did not re-establish the connections on an alternate path/ HBA. These paths showed as dead in ESX and it eventually removed these LUNs from the config before the SP rebooted. After the SP reboot everything was fine again.

This exact scenario happened in my DR environment after an attempted FC switch firmware upgrade. The upgrade failed due to a bad image but the switch was rebooted in a controlled fashion and the EXACT thing happened there too with the 8100 there. Only 2 LUNs were presented to my DR cluster and the 1 being accessed via fabric A disappeared just like in our primary site. After a while the SP rebooted itself and the LUN was available again. All VMs living on that volume were down hard for the duration.

Anyone else seen any strangeness like this? We don't know yet if this is a VMware, switch, or array problem but it is causing many here to start questioning the reliability of ESX (unbeknownst of the backend infrastructure of course). If ESX is causing the EVA SP to melt down forcing a reboot then HP has a serious problem. Either way this is no good and may cause us to move production off of ESX. Smiley Sad

Reply
0 Kudos
1 Reply
jharper1
Contributor
Contributor

oops wrong thread...

Reply
0 Kudos