10 Replies Latest reply on May 31, 2010 9:09 AM by elgordojimenez

    ESX 4 - Problems with VM's temporarily losing connection to SAN after removing datastores.

    Brad_Crossman Novice

       

      I have a problem that I hope someone can help me with.

       

       

      In a nutshell, since we have upgraded to vSphere ESX 4, we have encountered the following problem.

       

       

      When removing a volume/lun from our SAN, sometimes Virtual Machines will temporarily lose connection to the SAN. (datastores, RDM's and volumes are no longer in use)

       

       

      For example, if on our SAN i delete a volume or lun that is not needed anymore, thus making it unavailable to our ESX cluster, several of our VM's will briefly lose connectivity to their datastores.

       

       

      When this happens, the VM's will seem like they have lost network connectivity for about 10 seconds on average... but in reality, the VM, or the ESX host, cannot connect to the SAN, thus causing a temporary interruption in functionality on the OS level.

       

       

      This seems to happen with VMFS datastores & Raw Device Mapping LUN's.

       

       

      We did not have this problem on ESX 3.5.

       

       

      Let me tell you about our environment.

       

       

      -


       

       

      IBM Blade Center

       

       

      8 ESX Hosts running vSphere ESX 4 175625

       

       

      Over 80 Virtual Machines, mostly running Windows 2003.

       

       

      Blade Center is connected to Netapp SAN via FCP (Qlogic cards).  The Netapp SAN is a FAS3160. (2 Filers)

       

       

      Running Netapp Host Utilities 5.1 on ESX Hosts & Windows VM's.

       

       

      -


       

       

      Last night i performed some tests and got the following results:

       

       

      Environment - ESX12 Isolated from Blade Center  and from the esx_all SAN Initiator group.

       

       

      Nshterm.office.local VM running on ESX12.

       

       

      WA01 running normally in Blade Center.

       

       

      Continuously pinging Nshterm & WA01.

       

       

      Mapped & created 5 VMFS datastores labeled Test1-Test5 on ESX12.

       

       

      Mapped 5 RDM LUN's to ESX12.

       

       

      Test results:

       

       

      Test 1 - Destroyed Test1 VMFS volume from Netapp. No rescan of HBA. Result - Immediately lost 8 pings to NSHTERM. No ping loss to WA01.

       

       

      Test 2 - Destroyed Test10 RDM volume.  No rescan. Result - No lost pings.  Waited 5 minutes.

       

       

      Rescanned HBA's.

       

       

      Test 3 - Deleted Test2 VMFS datastore first, then destroyed volume on Netapp.  No rescan for 5 minutes.  Result - No lost pings

       

       

      Rescanned HBA's.

       

       

      Test 4 - Deleted Test9 RMD volume.  Immediate Rescan of HBA's.  Result - No lost pings.  Waited 5 minutes.

       

       

      Test 5 - Deleted Test3 datastore first, then destoryed volume from Netapp.  Immediate Rescan of HBA's.  Result - No lost pings.

       

       

      Test 6 - Destroyed Test4 VMFS volume from Netapp. No rescan.  Result - No lost pings.  Waited 5 minutes.

       

       

      Test 7 - Destroyed remaining 3 RDM volumes. No rescan. Result - No lost pings.  Waited 10 minutes.

       

       

      Rescanned HBA's - Rescanning of HBA's took longer than normal.  Right before the rescan completed, I lost 6 pings to NSHTERM. No ping loss to WA01.

       

       

      Test 8 - Deleted Test5 datastore first, then Destroyed Test5 VMFS volume from Netapp.  No rescan.  Result - Lost 2 pings.  Waited 5 minutes.

       

       

      -


       

       

      So as you can see, we surly have an issue here.  It does seem to be a little random however.

       

       

      Any thoughts or suggestions?  Has anyone run into this problem before?

       

       

        • 1. Re: ESX 4 - Problems with VM's temporarily losing connection to SAN after removing datastores.
          RParker Guru

          One thing I have learned, is you have to set the parameters of the Qlogic cards (and emulex) to the fiber type, and DON'T use automatic / autodetect settings.

           

          If you have 4GB fiber you set qlogic BIOS for EACH and EVERY host (yes, it's a pain) to the speed they are connecting to the fabric / switches.

           

          Some of the qlogic cards take a long time to rescan (driver problem) and thus during this time they drop the connections, which causes the VM's to drop.

           

          So as an experiment try putting one of the hosts in maintenance mode, reboot that host.  Use CTRL-Q to go into the qlogic HBA host BIOS, enable the BIOS and set the speed of the HBA to the speed of the port / switch (2GB , 4GB, etc..).

           

          Then reboot after, I think you will find your host will have better connectivity after that. 

           

          Also FYI, you don't need to rescan for ESX 4.0, it will detect the missing LUN's (sometimes in less than a minute) and it will remove the LUN automatically.. so manual rescans are not necessary.

           

          Then try the rescan, it should not only be faster rescan but also the performance will be slighly better, and it should not drop your LUNS.

           

          Also how many hosts are connected to your LUNS?  For shared LUNS you should try to keep ALL simultaneous connections to 8 hosts or less.

          • 2. Re: ESX 4 - Problems with VM's temporarily losing connection to SAN after removing datastores.
            Brad_Crossman Novice

             

            We have 8 hosts connected to our LUN's.

             

             

            On my test host i've changed the QLogic data rate speed from Auto Detect to 4GB/S.

             

             

            I will run my tests again tonight and let you know what happens.

             

             

            • 3. Re: ESX 4 - Problems with VM's temporarily losing connection to SAN after removing datastores.
              RParker Guru

              I will run my tests again tonight and let you know what happens.

               

              OK, please do.  And for ESX 4.0, after you remove the LUN's from zone, don't do a manual rescan, and it should remove the appropriate LUN after just a few seconds...

              • 4. Re: ESX 4 - Problems with VM's temporarily losing connection to SAN after removing datastores.
                markvor Enthusiast

                Do you have the actual bos on the blades an the controllers ?

                IBM released new BIOS for vSphere Support

                 

                Best Regards

                 

                Markus Vorderer

                • 5. Re: ESX 4 - Problems with VM's temporarily losing connection to SAN after removing datastores.
                  Brad_Crossman Novice

                   

                  BladeCenter HS21 XM

                   

                   

                  Type - 7995

                   

                   

                  Model: G6U

                   

                   

                  The current blade bios is version 1.12.  I don't see any release notes that state that a bios upgrade supports vSphere.

                   

                   

                  The QLogic bios version for the blades is 1.24.

                   

                   

                  The QLogic switch firmware version is 4.04.09.

                   

                   

                  I also did not see any fixes for vSphere.  If you can find them and show me, that would be great!

                   

                   

                  • 6. Re: ESX 4 - Problems with VM's temporarily losing connection to SAN after removing datastores.
                    markvor Enthusiast

                    I found that at the IBM Redbooks.

                     

                    I'll try to find it.

                     

                    MArkus

                    • 7. Re: ESX 4 - Problems with VM's temporarily losing connection to SAN after removing datastores.
                      Brad_Crossman Novice

                       

                      Ok, so my first initial tests did not cause a loss of connectivity to the SAN.

                       

                       

                      I created 2 VMFS datastores, without rescanning the HBA's.

                       

                       

                      Then I deleted the volumes from the Netapp, without deleting the datastores first.

                       

                       

                      After deleting the volumes from the Netapp, ESX did not remove the datastore.  Even after 10 minutes of waiting.

                       

                       

                      So I tried just refreshing, and this is where I ran into a problem.

                       

                       

                      Refreshing took a long time, and the test VM that I was pinging dropped 11 pings.

                       

                       

                      I then tried to add another datastore, and noticed that all of the LUN's that were deleted from our SAN, were still showing up.

                       

                       

                      So I created another volume/LUN, and presented it to our test ESX host.  I tried creating another VMFS datastore, however it took about 5 minutes, which isn't typical.

                       

                       

                      During that 5 minutes, pings to my test VM dropped on 3 different occations. 

                       

                       

                      The first dropped 11 pings

                       

                       

                      And the second & third time, it dropped 8 pings each.

                       

                       

                      This is getting VERY frustrating.  We never had this problem with ESX 3.5.  I may be forced to roll back to ESX 3.5.

                       

                       

                       

                       

                       

                      • 8. Re: ESX 4 - Problems with VM's temporarily losing connection to SAN after removing datastores.
                        ATKElkton Lurker

                         

                        We're seeing a similar problem. I just removed a virtual machine, unassigned the storage, and one of my three hosts (the one that "owned" the VM) lost connectivity to Virtual Center (though I can ping it still). All the machines running on it lost ping for a bit.. but are now responding, though the host is still inaccesible via VC and web console.

                         

                         

                        However... I can iLO into it and it does respond to pings on the console address.

                         

                         

                        So....   I ran esxcfg-rescan -u vmhba1

                         

                         

                        it sat for a while. and my server came back.

                         

                         

                        I'm using qlogic mezz cards in BL680c G6 blades - talking to Falconstor NSS devices.

                         

                         

                        Just like Brad indicated... this was not a problem in ESX 3.5 but I think it was waaaaay back in 2.5.

                         

                         

                        I'd love to see a resolution - I have tons of storage to remove and now.... I'm scared!

                         

                         

                         

                         

                         

                         

                         

                         

                        • 9. Re: ESX 4 - Problems with VM's temporarily losing connection to SAN after removing datastores.
                          Brad_Crossman Novice

                           

                          I fixed the problem with the help of someone other than Netapp and/or VMware support.

                           

                           

                          Here was the fix.

                           

                           

                          By default ESX 4 uses a MPIO protocol called ALUA.

                           

                           

                          ALUA, Asymmetric logical unit access

                           

                           

                          ALUA is a relatively new multipathing technology for asymmetrical arrays. If the array is ALUA compliant and the host multipathing layer is ALUA aware then virtually no additional configuration is required for proper path management by the host.

                           

                           

                          ALUA was NOT activated on our Initiator groups (LUN Masking) on the SAN.

                           

                           

                          I had to turn this on for all IG groups on our SAN and set each ESX servers default storage MPIO for Round Robin. (esxcli nmp satp setdefaultpsp --satp VMW_SATP_ALUA --psp VMW_PSP_RR)

                           

                           

                          I also installed the latest Netapp Host Utilities 5.1, however it was NOT needed. (But i installed it anyway because of the neat troubleshooting tools it comes with)

                           

                           

                          This doc was helpful, and so was the information below:

                           

                           

                          http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1010713&sliceId=1&docTypeID=DT_KB_1_1&dialogID=36736592&stateId=0 0 34725396

                           

                          vSphere: Upgrading from non-ALUA to ALUA

                           

                          Since vSphere provides ALUA support and enables Round-Robin I/O via the default PSP, here are the steps to migrate from a non-ALUA to an ALUA configuration and enabe the Round-Robin algorithm using a NetApp disk array.

                           

                           

                          1) Make sure you're running a supported  ONTAP version such as any version above 7.3.1

                           

                           

                          FAS2020A> version

                          NetApp Release 7.3.1.1: Mon Apr 20 22:58:46 PDT 2009

                           

                           

                          2) Enable the ALUA flag on the ESX igroups on each NetApp controller

                           

                           

                          FAS2020A> igroup show -v vmesx_b

                              vmesx_b (FCP):

                                  OS Type: vmware

                                  Member: 21:00:00:1b:32:10:27:3d (logged in on: vtic, 0b)

                                  Member: 21:01:00:1b:32:30:27:3d (logged in on: vtic, 0a)

                                  ALUA: No

                           

                           

                          FAS2020A> igroup set vmesx_b alua yes

                           

                           

                           

                           

                           

                          FAS2020A> igroup show -v vmesx_b

                              vmesx_b (FCP):

                                  OS Type: vmware

                                  Member: 21:00:00:1b:32:10:27:3d (logged in on: vtic, 0b)

                                  Member: 21:01:00:1b:32:30:27:3d (logged in on: vtic, 0a)

                                  ALUA: Yes

                           

                           

                          3) VMotion the VMs to another host in the Cluster and reboot the ESX host

                           

                           

                          4) After the Reboot, the SATP will change to VMW_SATP_ALUA and the PSP to VMW_PSP_MRU.

                           

                          *5)* *You will need to change the PSP to VMW_PSP_RR. There are 2 options*

                          *a) With the NetApp ESX Host Utilities Kit 5.1*

                            1a) #/opt/netapp/santools/config_mpath -m       -a CtlrA:username:password -a CtlrB:username:password  2a)  You will get a message to Reboot the host *b) Manually*  1a) # esxcli nmp satp setdefaultpsp --satp VMW_SATP_ALUA        --psp VMW_PSP_RR  2a) Reboot

                          *6) On the ESX Host verify the new setting per device*    # esxcli nmp device

                          naa.60a9800050334b356b4a51312f417541
                          Device Display Name: NETAPP Fibre Channel Disk (naa.60a9800050334b356b4a51312f417541)
                          Storage Array Type: VMW_SATP_ALUA
                          Storage Array Type Device Config: {implicit_support=on;explicit_support=off;explicit_allow=on;alua_followover=on;{TPG_id=2,TPG_state=AO}{TPG_id=3,TPG_state=ANO}}
                          Path Selection Policy: VMW_PSP_RR
                          Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0;lastPathIndex=3: NumIOsPending=0,numBytesPending=0}
                          Working Paths: vmhba2:C0:T2:L1, vmhba1:C0:T2:L1

                           

                          • 10. Re: ESX 4 - Problems with VM's temporarily losing connection to SAN after removing datastores.
                            elgordojimenez Enthusiast

                             

                            Hello,

                             

                             

                             

                            We are seeing the same issues as you have described, wih the only difference being that our storage array is 3PAR, I am unsure if ALUA applies to 3PAR arrays?, but what we have noticed is that with esx3.5 hosts no disconnects occur, with esx4u1, we see LUNS lose connectivity randomly, which causes the vms to stop pinging.

                             

                             

                            Are you aware of anything for 3par storage arrays?.

                             

                             

                             

                             

                             

                            Cheers.