ESXi 5 hosts take hours to boot when LUNs have SCS...

murphyslaw1978b · ‎01-29-2012

I have 2 IBM HS22 blades running a Win2k3 SQL 2005 cluster on ESXI 4.0. I'm putting in 2 new HX5 blades running ESXi 5.0. When I don't have any of the shared LUNs that are RDM presented to the hosts, they boot in 8 minutes (normal). When I present the LUNs that the SQL cluster is using, my hosts take 2-4 hours to boot. If I ALT-F12 on the console screen to see what's happening, I can clearly see that the host cannot access a LUN do to SCSI reservation, and it keeps trying multiple times. In total, there are 24 LUNs that are RAW disk, which is required for the SQL Cluster.

In summary, I'd love to get rid of the MS Cluster and shared disk, but I haven't found an easy way. So I need to make sure that I can migration confidently to the new blades without a 2-4 hour boot time issue. Any ideas?

Virtualinfra · ‎01-29-2012

Welcome to the community.

See if the VAAI code is supported for the storage your using and install them on ESXi host which might help to free scsi reservation.

Refer below for more information:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=102197...

Award points for the helpful and correct answer by clicking the below tab

Thanks & Regards Dharshan S VCP 4.0,VTSP 5.0, VCP 5.0

murphyslaw1978b · ‎01-29-2012

I checked to see if VAAI is enabled on the ESXi5 hosts, and it is. I've verified that my IBM SVC running 6.2 code is HCL-good. But on the ESXi 4.0 U0 hosts, I don't see the 2 settings at all. Isn't the problem that VAAI is not for 4.0 but for 4.1 and later?

Regardless, I'm not starting to think that the problem is indeed with ESXi 5 hosts. For example, using the SVC, I cloned the production LUNs and presented them to just my 2 ESXi hosts. I tried rescanning the datastores, and it takes considerably longer when the SCSI reservations are in use. Rebooting the host, it takes longer to boot up when in use. So I think there is something similar going on with the ESXi5 hosts and iSCSI LUNs as there is with Fibre Channel LUNs.

a_p_ · ‎01-29-2012

Which patch/build of ESXi 5.0 do you currently run on the hosts? Please take a look at http://kb.vmware.com/kb/2007108 to see whether this applies to your issue.

André

murphyslaw1978b · ‎01-29-2012

Yeah, I guess I could try it. I'm running FC, not iSCSI, but maybe worth a try anyway.

rlund · ‎01-29-2012

My apologies if this has been asked, but are you fully patched on esxi 5? Sounds a little like the unmap bug... This emc storage?

Roger Lund Minnesota VMUG leader Blogger VMware and IT Evangelist My Blog: http://itblog.rogerlund.net & http://www.vbrainstorm.com

kastlr · ‎01-29-2012

Hi,

this is a known VMware issue, check the following article.

ESX/ESXi hosts hosting passive MSCS nodes with RDM LUNs may take a long time to boot

Regards

Ralf

Hope this helps a bit.
Greetings from Germany. (CEST)

murphyslaw1978b · ‎01-30-2012

Hmm, I tried following the article, but a reboot is still taking hours (23 RDMs). Not sure if I did it right, but a "esxcli storage core device list -d naa.xxx" command shows that it took place. I'm thinking something else I going on, since the error messages I saw when booting up where not the same as those listed in the article.

Also, I have not patched the ESXi5 host at all - it's a canned original image. I did not see a patch for this specific issue, but perhaps there is a patch that may help. Is there a list of patches that I can apply in entirety? I might have to install the Update Manager to get all the patches available and deployed quickly and easily. Not sure if there is any other way to do it, or if I can download an ISO that has all the updated patches and fixes? I find it nearly just as difficult to simply reinstall than to apply multiple patches without Update Manager.

kastlr · ‎01-31-2012

Hi,

it might be helpfull to receive some more infos about the messages you got during host reboot.

And another thing to keep in mind is that ALL ESX Servers should see the pRDMs with identical LUN ID's.

Not 100% sure if this requirement is still needed with ESX5, but it was a requirement with 4.x.

Hth

Ralf

Hope this helps a bit.
Greetings from Germany. (CEST)

murphyslaw1978b · ‎01-31-2012

I did an interesting test: I cloned the production LUNs and now booting off a different SAN with just one of the blades and one of the VM cluster nodes running. The VM is incredibly slow, and attached is what's on the Alt-F12 screen on the ESXi console. I'm starting to think that the issue is an HBA firmware or driving issue, or perhaps a code issue with the IBM SVC.

Yes, when you describe that the RDM LUNs need to be the exact SCSI id and in the same order as the other guest (as well as the host, actually), I completely agree with that. Now, the good news is that all the disks in the cluster are coming online. So I know I've got them mappend correctly (with the SCSI controller set to physical bus sharing as well as the RDMs set to physical mode.

I also decided to boot up the other host, and booting is taking a long time again. The 2nd screenshot shows the error on the screen on the 2nd node.

All

ESXi 5 hosts take hours to boot when LUNs have SCSI reservation