Mikeluff
Contributor
Contributor

ESXi 3.5 Host Hangs Since U4?

Jump to solution

Hi, I have 9 ESX hosts running in three different clusters. Two clusters are HP DL385 G2's and one cluster is HP DL385 G5 hardaware. A week or so ago I noticed my test cluster and my G5 hardware cluster had issues where the Host would become unreachable from Vcen and the VM's running would drop out - the host would stay pingable and none of the VM's would online a A N other host in the cluster. Connecting to the host console it was slow, you could log on but not re-boot using F11 it would just sit there.

Thought issue might be due to USB sticks provided by HP as there is an issue with them, so HP replaced them all for me. All test boxes and G5 hosts were running U4, but had only been running for a short period of time. As I was replacing the USB's in all hosts apart from one in the production cluster (already replaced a few weeks ago) I though I would apply Update 4 and latest patches from Update Manager. Now all Hosts running U4 have at some point failed in the same way as above - can't figure out what's going on. Have logged call with VMWare but only Gold support (platinum next week I think) so when two hosts failed over the weekend it doesn't look good - latest ESXi U4 patches don't seem to have fixed the issue either... doh!

Anyone seem this before, or having the same issue while I wait on support? Different hardware so it must be a bug that's come with update 4... that one host that is still running U3 hasn't even twitched - so if all else fails I will have to back rev to the same version.

Cheers

0 Kudos
100 Replies
jmcdonald1
VMware Employee
VMware Employee

Hey Mike,

From my understanding, QA is working on testing the patch. It is not exactly a short process, as we are talking a storage driver. I hope to have more information soon.

Cheers,

/Jonathan

0 Kudos
Mikeluff
Contributor
Contributor

I am currently testing the patch for vmware on my test system, needless to say I doubt it will be long until the production version is released.

Mike

0 Kudos
spex
Expert
Expert

Here is the public KB just released yesterday.

http://kb.vmware.com/kb/1012575

0 Kudos
Mike625
Contributor
Contributor

The solution in the kb did not fix the instability that we have seen and the value of Misc.CimOemProvidersEnabled was already 0. The instability didn't go away until we followed instructions sent to us by tech support to shutdown 3 processes. We now have no hardware monitoring until this is fixed.

Instructions given if the items in the kb don't mitigate the issue:

disable CIM from running:

  1. /etc/init.d/sfcbd-watchdog stop

  2. /etc/init.d/wsmand stop

  3. /etc/init.d/slpd stop

This will stop sfcbd, wsmand and slpd. One thing to note is that if any the hosts are rebooted you will need to stop them in the same way.

0 Kudos
jmcdonald1
VMware Employee
VMware Employee

Hey Mike,

The Misc.CimOemProvidersEnabled parameter does not completely disable sfcbd, it only disables the link between our code to monitor the hardware and the third party. Thus, the problem would likely still exist because CIM is still started. In the article we say to set Misc.CimEnabled to 0 which will completely disable CIM. After the reboot of the server, you will see that no sfcbd processes are started. (I tested this fully before I wrote the KB...;) )

Your instructions are correct though, this would in effect accomplish the same thing as following the KB, we just need a more user friendly way to accomplish the task. Smiley Happy

Cheers,

/Jonathan

0 Kudos
Mikeluff
Contributor
Contributor

FYI - I'm still testing the patch for this, getting new version next week.

0 Kudos
fgw
Contributor
Contributor

jonathan,

i can confirm what you said!

setting Misc.CimEnabled to 0 AND rebooting the server will fix this! have done this on my servers from the time this problem popped up and a vmware support suggested this as a workaround. this issue did not come up again since then, which is about two month ago now!

also, as it might not be convinient to reboot your server you can also set Misc.CimEnabled to 0 AND stop the process manually with the command:

/etc/init.d/sfcbd-watchdog stop

jonathan you might add this to your KB ...

0 Kudos
Mike625
Contributor
Contributor

Thanks Jonathan, my mistake. This is much friendlier workaround.

0 Kudos
jmcdonald1
VMware Employee
VMware Employee

I originally had both set of steps in the KB, however it is against our documentation policy to publish steps that someone would run in tech support mode. Officially as a policy tech support mode is only supposed to be used as requested by a tech support representative while working on a case.

0 Kudos
jmcdonald1
VMware Employee
VMware Employee

Excellent. Let us know if you see any unexpected behavior in the mean time. The last I heard earlier this week our internal QA tests have also so far been successful. To be fully sure we need to let the environment sit for a few days with the tests running.

0 Kudos
HMC-Frank
Contributor
Contributor

I am running ESXi 3.5 U3 and U4 on five Dell 2950 servers with recent Bios and BMC. I have seen the disconnect happen on three out of five servers, although VMs stay running,syslog reporting WorldInit failed errors and the others. Eventhough case open with Dell Enterprise support on issue with pending log reviews, so happy to find this discussion and new VM KB article. Applying solution to all five currently. Will post if issue occurs after. No news is good news.

0 Kudos
stefanjansson
Contributor
Contributor

Jag har semester,åter 17 aug

I´m on vacation ,will be back on August 17

mvh /regards

// Stefan

0 Kudos
Mikeluff
Contributor
Contributor

Patch due for release end Oct...

0 Kudos
jparnell
Hot Shot
Hot Shot

Do other people still get a health status in the VIC even after disabling misc.cimenabled and misc.cimoemprovidersenabled? (i've rebooted). I've applied all the latest patches and i'm just concerned that the ESXi server may crash if cim hasnt been disabled properly. On other hosts i've got the hardware health as 'unknown'

0 Kudos
Mikeluff
Contributor
Contributor

I've had the same issue in the past - if you look back in this thread there is also an xml file you can edit.

Although mine are all currently disabled just using the setting within the client.

0 Kudos
jmcdonald1
VMware Employee
VMware Employee

If i remember correctly the health status item may still show, it just doesnt update the information. I think that if you click update you will also see an error since it cannot connect. You can actually verify if it is truly disabled by logging into tech support mode and running: ps |grep sfcbd. If you do not see any results from that it means that it is not running.

Cheers,

/Jonathan

0 Kudos
paudieo
VMware Employee
VMware Employee

VMware patches are avialable for download to resolve this

http://kb.vmware.com/kb/1013132

i.e. Patch ESX350-200910409-BG

This patch requires the following patches to be installed also

ESX350-200910401-SG (http://kb.vmware.com/kb/1013124)

ESX350-200910402-BG (http://kb.vmware.com/kb/1013125)

0 Kudos
fgw
Contributor
Contributor

looks like this patch is for esx3.5 and not esx3.5i?

although there is a new firmware image for esx35i available including this fixes (checked this some days ago), i cant see any hint to a fix for the problem described here?

btw. cant access patch downloads anymore. as soon a i click search i get this:

Authentication Error
You are not allowed to access this page.
If you feel this is in error, please try again.
If the error persists, Please contact VMware Technical Support.

anybody else?

EDIT: access to patch database is working again ...

0 Kudos
jmcdonald1
VMware Employee
VMware Employee

This appears to be a problem with the actual download page, which I will report. The patch for ESXi is:

ESXe350-200910401-I-SG

Cheers,

/Jonathan

0 Kudos
fgw
Contributor
Contributor

jonathan,

download page is working again, thanks.

are you saying ESXe350-200910401-I-SG will fix the problem discussed in this thread? from looking at the summaries section of ESXe350-200910401-I-SG, i cant find any word on this. also do you have the PR number under which this bug is filed? may be its fixed but simply not documented clearly ...

Summaries

This patch contains the following:

  • ESXi 3.5 Update 4 hosts with Emulex HBAs might stop responding when accessed through vCenter Server.

  • This patch reduces the boot time of ESX hosts and should be
    applied when multiple ESX hosts detect LUNs used for Microsoft Cluster
    Service (MSCS).

  • After applying this patch, any request for connection with
    ESXi 3.5 using cipher suite of 56-bit encryption will be dropped. This
    also includes any request for connection to CIM port 5989 on ESXi 3.5.
    As a result, browsers that exclusively use cipher suites with 40-bit
    and 56-bit encryption cannot connect to ESXi 3.5. Microsoft has made
    the Internet Explorer High Encryption Pack available for Internet
    Explorer 5.01 and earlier. Internet Explorer 5.5 and later versions
    already use 128-bit encryption.

  • ESXi host fails if its available TCP/IP sockets are exhausted and an NFS Client has a directory mounted.

0 Kudos