I have a problem that I hope someone can help me with.
In a nutshell, since we have upgraded to vSphere ESX 4, we have encountered the following problem.
When removing a volume/lun from our SAN, sometimes Virtual Machines will temporarily lose connection to the SAN. (datastores, RDM's and volumes are no longer in use)
For example, if on our SAN i delete a volume or lun that is not needed anymore, thus making it unavailable to our ESX cluster, several of our VM's will briefly lose connectivity to their datastores.
When this happens, the VM's will seem like they have lost network connectivity for about 10 seconds on average... but in reality, the VM, or the ESX host, cannot connect to the SAN, thus causing a temporary interruption in functionality on the OS level.
This seems to happen with VMFS datastores & Raw Device Mapping LUN's.
We did not have this problem on ESX 3.5.
Let me tell you about our environment.
IBM Blade Center
8 ESX Hosts running vSphere ESX 4 175625
Over 80 Virtual Machines, mostly running Windows 2003.
Blade Center is connected to Netapp SAN via FCP (Qlogic cards). The Netapp SAN is a FAS3160. (2 Filers)
Running Netapp Host Utilities 5.1 on ESX Hosts & Windows VM's.
Last night i performed some tests and got the following results:
Environment - ESX12 Isolated from Blade Center and from the esx_all SAN Initiator group.
Nshterm.office.local VM running on ESX12.
WA01 running normally in Blade Center.
Continuously pinging Nshterm & WA01.
Mapped & created 5 VMFS datastores labeled Test1-Test5 on ESX12.
Mapped 5 RDM LUN's to ESX12.
Test 1 - Destroyed Test1 VMFS volume from Netapp. No rescan of HBA. Result - Immediately lost 8 pings to NSHTERM. No ping loss to WA01.
Test 2 - Destroyed Test10 RDM volume. No rescan. Result - No lost pings. Waited 5 minutes.
Test 3 - Deleted Test2 VMFS datastore first, then destroyed volume on Netapp. No rescan for 5 minutes. Result - No lost pings
Test 4 - Deleted Test9 RMD volume. Immediate Rescan of HBA's. Result - No lost pings. Waited 5 minutes.
Test 5 - Deleted Test3 datastore first, then destoryed volume from Netapp. Immediate Rescan of HBA's. Result - No lost pings.
Test 6 - Destroyed Test4 VMFS volume from Netapp. No rescan. Result - No lost pings. Waited 5 minutes.
Test 7 - Destroyed remaining 3 RDM volumes. No rescan. Result - No lost pings. Waited 10 minutes.
Rescanned HBA's - Rescanning of HBA's took longer than normal. Right before the rescan completed, I lost 6 pings to NSHTERM. No ping loss to WA01.
Test 8 - Deleted Test5 datastore first, then Destroyed Test5 VMFS volume from Netapp. No rescan. Result - Lost 2 pings. Waited 5 minutes.
So as you can see, we surly have an issue here. It does seem to be a little random however.
Any thoughts or suggestions? Has anyone run into this problem before?