VMware Cloud Community
Brad_Crossman
Contributor
Contributor

ESX 4 - Problems with VM's temporarily losing connection to SAN after removing datastores.

I have a problem that I hope someone can help me with.

In a nutshell, since we have upgraded to vSphere ESX 4, we have encountered the following problem.

When removing a volume/lun from our SAN, sometimes Virtual Machines will temporarily lose connection to the SAN. (datastores, RDM's and volumes are no longer in use)

For example, if on our SAN i delete a volume or lun that is not needed anymore, thus making it unavailable to our ESX cluster, several of our VM's will briefly lose connectivity to their datastores.

When this happens, the VM's will seem like they have lost network connectivity for about 10 seconds on average... but in reality, the VM, or the ESX host, cannot connect to the SAN, thus causing a temporary interruption in functionality on the OS level.

This seems to happen with VMFS datastores & Raw Device Mapping LUN's.

We did not have this problem on ESX 3.5.

Let me tell you about our environment.

-


IBM Blade Center

8 ESX Hosts running vSphere ESX 4 175625

Over 80 Virtual Machines, mostly running Windows 2003.

Blade Center is connected to Netapp SAN via FCP (Qlogic cards). The Netapp SAN is a FAS3160. (2 Filers)

Running Netapp Host Utilities 5.1 on ESX Hosts & Windows VM's.

-


Last night i performed some tests and got the following results:

Environment - ESX12 Isolated from Blade Center and from the esx_all SAN Initiator group.

Nshterm.office.local VM running on ESX12.

WA01 running normally in Blade Center.

Continuously pinging Nshterm & WA01.

Mapped & created 5 VMFS datastores labeled Test1-Test5 on ESX12.

Mapped 5 RDM LUN's to ESX12.

Test results:

Test 1 - Destroyed Test1 VMFS volume from Netapp. No rescan of HBA. Result - Immediately lost 8 pings to NSHTERM. No ping loss to WA01.

Test 2 - Destroyed Test10 RDM volume. No rescan. Result - No lost pings. Waited 5 minutes.

Rescanned HBA's.

Test 3 - Deleted Test2 VMFS datastore first, then destroyed volume on Netapp. No rescan for 5 minutes. Result - No lost pings

Rescanned HBA's.

Test 4 - Deleted Test9 RMD volume. Immediate Rescan of HBA's. Result - No lost pings. Waited 5 minutes.

Test 5 - Deleted Test3 datastore first, then destoryed volume from Netapp. Immediate Rescan of HBA's. Result - No lost pings.

Test 6 - Destroyed Test4 VMFS volume from Netapp. No rescan. Result - No lost pings. Waited 5 minutes.

Test 7 - Destroyed remaining 3 RDM volumes. No rescan. Result - No lost pings. Waited 10 minutes.

Rescanned HBA's - Rescanning of HBA's took longer than normal. Right before the rescan completed, I lost 6 pings to NSHTERM. No ping loss to WA01.

Test 8 - Deleted Test5 datastore first, then Destroyed Test5 VMFS volume from Netapp. No rescan. Result - Lost 2 pings. Waited 5 minutes.

-


So as you can see, we surly have an issue here. It does seem to be a little random however.

Any thoughts or suggestions? Has anyone run into this problem before?

Reply
0 Kudos
10 Replies
RParker
Immortal
Immortal

One thing I have learned, is you have to set the parameters of the Qlogic cards (and emulex) to the fiber type, and DON'T use automatic / autodetect settings.

If you have 4GB fiber you set qlogic BIOS for EACH and EVERY host (yes, it's a pain) to the speed they are connecting to the fabric / switches.

Some of the qlogic cards take a long time to rescan (driver problem) and thus during this time they drop the connections, which causes the VM's to drop.

So as an experiment try putting one of the hosts in maintenance mode, reboot that host. Use CTRL-Q to go into the qlogic HBA host BIOS, enable the BIOS and set the speed of the HBA to the speed of the port / switch (2GB , 4GB, etc..).

Then reboot after, I think you will find your host will have better connectivity after that.

Also FYI, you don't need to rescan for ESX 4.0, it will detect the missing LUN's (sometimes in less than a minute) and it will remove the LUN automatically.. so manual rescans are not necessary.

Then try the rescan, it should not only be faster rescan but also the performance will be slighly better, and it should not drop your LUNS.

Also how many hosts are connected to your LUNS? For shared LUNS you should try to keep ALL simultaneous connections to 8 hosts or less.

Reply
0 Kudos
Brad_Crossman
Contributor
Contributor

We have 8 hosts connected to our LUN's.

On my test host i've changed the QLogic data rate speed from Auto Detect to 4GB/S.

I will run my tests again tonight and let you know what happens.

Reply
0 Kudos
RParker
Immortal
Immortal

I will run my tests again tonight and let you know what happens.

OK, please do. And for ESX 4.0, after you remove the LUN's from zone, don't do a manual rescan, and it should remove the appropriate LUN after just a few seconds...

Reply
0 Kudos
markvor
Enthusiast
Enthusiast

Do you have the actual bos on the blades an the controllers ?

IBM released new BIOS for vSphere Support

Best Regards

Markus Vorderer

Reply
0 Kudos
Brad_Crossman
Contributor
Contributor

BladeCenter HS21 XM

Type - 7995

Model: G6U

The current blade bios is version 1.12. I don't see any release notes that state that a bios upgrade supports vSphere.

The QLogic bios version for the blades is 1.24.

The QLogic switch firmware version is 4.04.09.

I also did not see any fixes for vSphere. If you can find them and show me, that would be great!

Reply
0 Kudos
markvor
Enthusiast
Enthusiast

I found that at the IBM Redbooks.

I'll try to find it.

MArkus

Reply
0 Kudos
Brad_Crossman
Contributor
Contributor

Ok, so my first initial tests did not cause a loss of connectivity to the SAN.

I created 2 VMFS datastores, without rescanning the HBA's.

Then I deleted the volumes from the Netapp, without deleting the datastores first.

After deleting the volumes from the Netapp, ESX did not remove the datastore. Even after 10 minutes of waiting.

So I tried just refreshing, and this is where I ran into a problem.

Refreshing took a long time, and the test VM that I was pinging dropped 11 pings.

I then tried to add another datastore, and noticed that all of the LUN's that were deleted from our SAN, were still showing up.

So I created another volume/LUN, and presented it to our test ESX host. I tried creating another VMFS datastore, however it took about 5 minutes, which isn't typical.

During that 5 minutes, pings to my test VM dropped on 3 different occations.

The first dropped 11 pings

And the second & third time, it dropped 8 pings each.

This is getting VERY frustrating. We never had this problem with ESX 3.5. I may be forced to roll back to ESX 3.5.

Reply
0 Kudos
ATKElkton
Contributor
Contributor

We're seeing a similar problem. I just removed a virtual machine, unassigned the storage, and one of my three hosts (the one that "owned" the VM) lost connectivity to Virtual Center (though I can ping it still). All the machines running on it lost ping for a bit.. but are now responding, though the host is still inaccesible via VC and web console.

However... I can iLO into it and it does respond to pings on the console address.

So.... I ran esxcfg-rescan -u vmhba1

it sat for a while. and my server came back.

I'm using qlogic mezz cards in BL680c G6 blades - talking to Falconstor NSS devices.

Just like Brad indicated... this was not a problem in ESX 3.5 but I think it was waaaaay back in 2.5.

I'd love to see a resolution - I have tons of storage to remove and now.... I'm scared!

Reply
0 Kudos
Brad_Crossman
Contributor
Contributor

I fixed the problem with the help of someone other than Netapp and/or VMware support.

Here was the fix.

By default ESX 4 uses a MPIO protocol called ALUA.

ALUA, Asymmetric logical unit access

ALUA is a relatively new multipathing technology for asymmetrical arrays. If the array is ALUA compliant and the host multipathing layer is ALUA aware then virtually no additional configuration is required for proper path management by the host.

ALUA was NOT activated on our Initiator groups (LUN Masking) on the SAN.

I had to turn this on for all IG groups on our SAN and set each ESX servers default storage MPIO for Round Robin. (esxcli nmp satp setdefaultpsp --satp VMW_SATP_ALUA --psp VMW_PSP_RR)

I also installed the latest Netapp Host Utilities 5.1, however it was NOT needed. (But i installed it anyway because of the neat troubleshooting tools it comes with)

This doc was helpful, and so was the information below:

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1010713&sl... 0 34725396

vSphere: Upgrading from non-ALUA to ALUA

Since vSphere provides ALUA support and enables Round-Robin I/O via the default PSP, here are the steps to migrate from a non-ALUA to an ALUA configuration and enabe the Round-Robin algorithm using a NetApp disk array.

1) Make sure you're running a supported ONTAP version such as any version above 7.3.1

FAS2020A> version

NetApp Release 7.3.1.1: Mon Apr 20 22:58:46 PDT 2009

2) Enable the ALUA flag on the ESX igroups on each NetApp controller

FAS2020A> igroup show -v vmesx_b

vmesx_b (FCP):

OS Type: vmware

Member: 21:00:00:1b:32:10:27:3d (logged in on: vtic, 0b)

Member: 21:01:00:1b:32:30:27:3d (logged in on: vtic, 0a)

ALUA: No

FAS2020A> igroup set vmesx_b alua yes

FAS2020A> igroup show -v vmesx_b

vmesx_b (FCP):

OS Type: vmware

Member: 21:00:00:1b:32:10:27:3d (logged in on: vtic, 0b)

Member: 21:01:00:1b:32:30:27:3d (logged in on: vtic, 0a)

ALUA: Yes

3) VMotion the VMs to another host in the Cluster and reboot the ESX host

4) After the Reboot, the SATP will change to VMW_SATP_ALUA and the PSP to VMW_PSP_MRU.

*5)* *You will need to change the PSP to VMW_PSP_RR. There are 2 options*

*a) With the NetApp ESX Host Utilities Kit 5.1*

1a) #/opt/netapp/santools/config_mpath -m -a CtlrA:username:password -a CtlrB:username:password 2a) You will get a message to Reboot the host *b) Manually* 1a) # esxcli nmp satp setdefaultpsp --satp VMW_SATP_ALUA --psp VMW_PSP_RR 2a) Reboot

*6) On the ESX Host verify the new setting per device* # esxcli nmp device

naa.60a9800050334b356b4a51312f417541
Device Display Name: NETAPP Fibre Channel Disk (naa.60a9800050334b356b4a51312f417541)
Storage Array Type: VMW_SATP_ALUA
Storage Array Type Device Config: {implicit_support=on;explicit_support=off;explicit_allow=on;alua_followover=on;{TPG_id=2,TPG_state=AO}{TPG_id=3,TPG_state=ANO}}
Path Selection Policy: VMW_PSP_RR
Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0;lastPathIndex=3: NumIOsPending=0,numBytesPending=0}
Working Paths: vmhba2:C0:T2:L1, vmhba1:C0:T2:L1

Reply
0 Kudos
elgordojimenez
Contributor
Contributor

Hello,

We are seeing the same issues as you have described, wih the only difference being that our storage array is 3PAR, I am unsure if ALUA applies to 3PAR arrays?, but what we have noticed is that with esx3.5 hosts no disconnects occur, with esx4u1, we see LUNS lose connectivity randomly, which causes the vms to stop pinging.

Are you aware of anything for 3par storage arrays?.

Cheers.

**** If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful ****
Reply
0 Kudos