VMware Cloud Community
joshuacneal
Contributor
Contributor

After updating to ESXi 4.1, ESXi boot up hangs on "cbt loaded successfully"

I recently started having problems with my ESXi 4.0 server. So after trying to reset the management tools and rebooting I decided to reset the system configuration. After I did this and reset all my network settings I was able to connect using the VSphere client however when I attempted to browse the datatstore it would take forever and the .... would just keep growing. Eventually it would display my top list of folders in my datastore but I could not browse any further. I would get a message that said the server took too long to respond.

So in my infinite wisdom I decided this would be a good time to upgrade to ESXi 4.1 and see if that resolved the issue. Well it did not.

I ran the update using the VMware CLI, and after it a while it said the update applied successfully but now the server needed a reboot. After rebooting the server now hangs during the boot up process. It gets to the part where it says CBT loaded successfully, then it just stops. I have let it run overnight but to no avail. I assigned a static IP to this server and I am able to ping thast IP, but I can not connect using the CLI or anything else. Any help is greatly appreciated!!

Please let me know any other info anyone might need to help me resolve this, and please bear with me I am still somewhat of a newbie to ESXi.

Tags (3)
Reply
0 Kudos
13 Replies
golddiggie
Champion
Champion

What's your host hardware? Have you run any tests on the drives that ESXi is residing upon? How about running the memtest86+ tool on it? I would allow that to run at least overnight or for 24 hours to see if anything comes up. I ran it on three new servers that were to be set up as ESX hosts before installing ESX... It was able to stress the system enough that one of the memory chips gave error messages (it was destined to fail, this just made it happen before we installed ESX onto it).

I would also run any diagnostics that you can download (from the hardware manufacturer) and boot from to run (such as off of a cd/dvd or even USB flash drive). You could have a drive in the host starting to fail. Or the RAID card is having issues. Too many possibilities without running these tests/diagnostics on the host server. Unless you get an error code that the manufacturer recognizes as a specific type, and has a resolution to... You could also try updating the server's bios, if it has any updates available to it...

Network Administrator

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

Reply
0 Kudos
Ruud_van_Strijp
Contributor
Contributor

I had this same error, after upgrading from a fully functional and operational ESXi 4.0 U2 to ESXi 4.1, using the vihostupdate.pl script from the CLI.

ESXi is installed on a 750GB disk, and I have a Promise SuperTrak 8654 with 4 disks for the VMs. ESXi kept stalling at 'cbt loaded succesfully'. But after I unplugged my SAS cable from my Supertrak card, VMware ESXi 4.1 just started without any issues. So I assume it has some driver issues with that card.

Maybe you can unplug some disks as well, if your ESXi is installed on a different disk. At least this way you can boot into VMware and maybe upgrade drivers. That is at least what I'm going to try now.

Edit: One of my drives appeared to be failing. I removed it from my array and ESXi booted without any issues Smiley Happy

Reply
0 Kudos
TAE
Contributor
Contributor

While this is not a "helpful" response, I want to let you know that you're not alone. I have a collection new R710 servers that are connected to back-end NetApp disk via fiber HBAs. The servers that I initially stood up are ESX 4.0 U1 that run fine. I took three servers that had yet to be moved to "production" (they weren't hosting any VM's, but were part of the cluster) and upgraded them to ESXi 4.1 (read: installed fresh).

These newly built servers now take 60 minutes to boot. They first hang at "vmw_vaaip_netapp loaded successfully" and then again at the "cbt loaded successfully". I've started a discussion on this topic here (http://communities.vmware.com/thread/279042?tstart=30) with no response yet. I was beginning to worry that I was the only one seeing this sort if thing, until I saw your post.

What's even more interesting is that it only is happening to new ESXi 4.1 hosts that have disk provisioned (visible) to the ESXi hosts. I have other hosts that boot fine (quickly, even), but they don't have disk provisioned to them yet.

Eventually the hosts boot up, but since it's taking 60 minutes, I'm of the opinion that Something Ain't Right (TM).

-Todd

Reply
0 Kudos
ilion1
Contributor
Contributor

Hello,

I just ran into a similar issue. Anything after "CBT loaded successfully" was very slow. After I rebooted the NAS system providing iSCSI storage the ESXi host startup was normal again. Upon the next reboot the ESXi host would not reconnect to the storage. The storage system however showed an active connection from the host. Eventually I noticed that the iSCSI connection is still active when the ESXi host is shut down or rebooted. This means that after a reboot the ESXi host times out when trying to reconnect. This results in the delay at "CBT loaded successfully" and anything after that. Whenever I disconnect the connection on the storage system manually after the host is shut down the next start will be fine.

To me it seems that there is an error in the shutdown process of ESXi 4.1 not disconnecting from iSCSI storage properly.

Reply
0 Kudos
robert_eckdale
Enthusiast
Enthusiast

Similar situtation.

ESXi v4.1 (installed on internal USB key) running on HP BL490c blades connected to an HP EVA 8400. Several of the ESXi hosts are running MSCS VMs (cluster across boxes) utilizing RDM LUNs. MSCS are a mix of 2003 R2 and 2008 R2, but ESXi hosts are running one or the other. I.E. No host has a 2003 and 2008 R2 MSCS VM.

  • A reboot of a host with no RDM LUNs takes approximately 1:30.

  • A reboot of a host with MSCS (2008 R2) LUNS takes the same amount of time.

  • A reboot of a host with MSCS (2003 R2) RDM LUNs take 15+ minutes with a very long pause on 'vmw_satp_eva loaded'.

I ran a quick test on the same host, here are the results

Test 1

Total number of LUNS: 15

Number of RDM LUNS: 6

Number of MSCS (2003 R2) RDM LUNS: 6

Boot time: 18:10

Test 2

Total number of LUNS: 10

Number of RDM LUNS: 0

Number of MSCS (2003 R2) RDM LUNS: 0

Boot time: 1:25

Test 3

Total number of LUNS: 11

Number of RDM LUNS: 1

Number of MSCS (2003 R2) RDM LUNS: 1

Boot time: 5:42

Reply
0 Kudos
ilion1
Contributor
Contributor

Dear sender,

I will be back to the office on October 26th. Your email will not be forwarded.

Best Regards

Klaus Troja

Reply
0 Kudos
Chopper3
Contributor
Contributor

Seeing EXACTLY the same behaviour!

Reply
0 Kudos
golddiggie
Champion
Champion

What's the physical server hardware that you're running ESXi 4.1 on?? I never saw the OP respond with this info, so I suspect the host is not on the VMware HCL for ESX/ESXi 4.1... It would also be good to know HOW you went about updating the host. Give as much detail as possible for the hardware your running (make/model of server, CPU's, RAM, storage, etc.)...

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

Reply
0 Kudos
TAE
Contributor
Contributor

FWIW, this issue is documented in the vSphere 4.1 release notes:

Persistent reservations conflicts on shared LUNs might cause ESX and ESXi hosts to boot longer

You might experience significant delays while booting your hosts that share LUNs on a SAN. This could be because of conflicts between the LUN SCSI reservations.

Workaround: To resolve this issue and speed up the boot process, change the timeout for synchronous commands during boot time to 10 seconds. You can do this by setting the Scsi.CRTimeoutDuringBoot parameter to 10000.

To modify the parameter from the vSphere Client:

1.In the vSphere Client inventory panel, select the host, click the Configuration tab, and click Advanced Settings under Software.

2.Select SCSI.

3.Change the Scsi.CRTimeoutDuringBoot value to 10000.

I was directed to this note by VMware support when I opened a case on this issue. This is manifesting for me because I am using MSCS in the virtual environment (cluster across boxes) and for every shared LUN that the claiming process runs into, there's about a 90 second timeout before the failing and moving onto the next. I was very easily able to see this in the "messages" log on my host and could tie the error/timeouts back to my shared LUNs.

The only way around this issue is to:

A) not use MSCS with the shared controller (or not use MSCS at all, which is the impression I get from VMware), or

B) isolate the LUNs used by your clustered VMs to their own datacenter, that is mask them from the other hosts so they're not seen on startup.

Otherwise, I've reduced the value noted above to 5000 without any problems this far.

Hope this helps.

-Todd

Reply
0 Kudos
roconnor
Enthusiast
Enthusiast

Got the same situation, thanks for the pointers, we used to have this issue in 3.5 so we tweaked the -> Advance Setting -> Scsi -> SsciRetries from 80 to 25. But VMware have now fixed that with a patch (see http://kb.vmware.com/kb/1009287)

I stumbled on a new KB:  ESX/ESXi 4.x hosts hosting passive MSCS nodes with RDM LUNs may take a long time to boot

http://kb.vmware.com/kb/1016106 - Updated: 7 Feb 2011

To resolve this issue, you must modify an advanced option on ESX and ESXi hosts to speed up the boot process.


For ESX and ESXi 4.0 hosts

The Scsi.UWConflictRetries parameter for ESX/ESXi 4 Update 1 hosts have a default value of 1000. This increases the time spent enumerating LUN and VMFS volumes.

To resolve this issue and to speed up the boot process, modify this value to 80.

To modify the Scsi.UWConflictRetries parameter from the GUI:

  1. Go to Host > Configuration > Advanced      settings.
  2. In the Advanced settings      window, select SCSI.
  3. Change the Scsi.UWConflictRetries value to      80.


For ESX and ESXi 4.1 hosts

To resolve this issue in ESX/ESXi 4.1 hosts, you must modify the Scsi.CRTimeoutDuringBoot parameter from the GUI.

To modify the Scsi.CRTimeoutDuringBoot parameter:

  1. Go to Host > Configuration > Advanced      Settings.
  2. Select SCSI.
  3. Change the Scsi.CRTimeoutDuringBoot value      to 1.

I have just tried it on a ESX 4.1 host (HP DL580 G5 attached to Netapp and HDS) and starting path claiming time went from 30 minutes to 30 seconds.

Hope this helps

Reply
0 Kudos
Ruud_van_Strijp
Contributor
Contributor

Thanks for your reply, roconnor. However, for some reason this workaround did not solve my issue. My ESXi 4.1 still has been hanging on "cbt loaded successfully. Running usbarbitrator start" for over two hours now.

Whenever I format my RAID5 array with NTFS in Windows Server 2008 R2, the array works just fine. So I don't think the array has any issues. My controller is a Promise SuperTrak EX8654.

Reply
0 Kudos
roconnor
Enthusiast
Enthusiast

Hi Rudd

Sorry but I don't quite understand your setup, you say; "ESXi is installed on a 750GB disk, and I have a Promise SuperTrak 8654 with 4  disks for the VMs. ESXi ... after I  unplugged my SAS cable from my Supertrak card, VMware ESXi 4.1 just started  without any issues. So I assume it has some driver issues with that card."

Where is the ESXi installed on a 750 GB local disc? and where are your datastores, on an external disc array available only to this this server?

I hate to say this, but have you checked all this hardware is on the VMware HCL?

Perhaps I am out of depth here, we are using ESX not ESXi, so the error messages are not the same.

Reply
0 Kudos
roconnor
Enthusiast
Enthusiast

One other thing for everyone else connected to SAN

Check how the storage guys have configured LUN visibility, we are trying to fix the situation where they grouped all the hosts together in a single 'hostgroup'.

This apparently is not a good idea, now they have created a separate hostgroup for each ESX, ( we empty the server first, they apply the change and we reboot)

Also they should have the right paramenter set that identifies the type of host (ESX or some other type of server)

Reply
0 Kudos