Hi, I have 9 ESX hosts running in three different clusters. Two clusters are HP DL385 G2's and one cluster is HP DL385 G5 hardaware. A week or so ago I noticed my test cluster and my G5 hardware cluster had issues where the Host would become unreachable from Vcen and the VM's running would drop out - the host would stay pingable and none of the VM's would online a A N other host in the cluster. Connecting to the host console it was slow, you could log on but not re-boot using F11 it would just sit there.
Thought issue might be due to USB sticks provided by HP as there is an issue with them, so HP replaced them all for me. All test boxes and G5 hosts were running U4, but had only been running for a short period of time. As I was replacing the USB's in all hosts apart from one in the production cluster (already replaced a few weeks ago) I though I would apply Update 4 and latest patches from Update Manager. Now all Hosts running U4 have at some point failed in the same way as above - can't figure out what's going on. Have logged call with VMWare but only Gold support (platinum next week I think) so when two hosts failed over the weekend it doesn't look good - latest ESXi U4 patches don't seem to have fixed the issue either... doh!
Anyone seem this before, or having the same issue while I wait on support? Different hardware so it must be a bug that's come with update 4... that one host that is still running U3 hasn't even twitched - so if all else fails I will have to back rev to the same version.
From my understanding, QA is working on testing the patch. It is not exactly a short process, as we are talking a storage driver. I hope to have more information soon.
The solution in the kb did not fix the instability that we have seen and the value of Misc.CimOemProvidersEnabled was already 0. The instability didn't go away until we followed instructions sent to us by tech support to shutdown 3 processes. We now have no hardware monitoring until this is fixed.
Instructions given if the items in the kb don't mitigate the issue:
disable CIM from running:
This will stop sfcbd, wsmand and slpd. One thing to note is that if any the hosts are rebooted you will need to stop them in the same way.
The Misc.CimOemProvidersEnabled parameter does not completely disable sfcbd, it only disables the link between our code to monitor the hardware and the third party. Thus, the problem would likely still exist because CIM is still started. In the article we say to set Misc.CimEnabled to 0 which will completely disable CIM. After the reboot of the server, you will see that no sfcbd processes are started. (I tested this fully before I wrote the KB...;) )
Your instructions are correct though, this would in effect accomplish the same thing as following the KB, we just need a more user friendly way to accomplish the task.
i can confirm what you said!
setting Misc.CimEnabled to 0 AND rebooting the server will fix this! have done this on my servers from the time this problem popped up and a vmware support suggested this as a workaround. this issue did not come up again since then, which is about two month ago now!
also, as it might not be convinient to reboot your server you can also set Misc.CimEnabled to 0 AND stop the process manually with the command:
jonathan you might add this to your KB ...
I originally had both set of steps in the KB, however it is against our documentation policy to publish steps that someone would run in tech support mode. Officially as a policy tech support mode is only supposed to be used as requested by a tech support representative while working on a case.
Excellent. Let us know if you see any unexpected behavior in the mean time. The last I heard earlier this week our internal QA tests have also so far been successful. To be fully sure we need to let the environment sit for a few days with the tests running.
I am running ESXi 3.5 U3 and U4 on five Dell 2950 servers with recent Bios and BMC. I have seen the disconnect happen on three out of five servers, although VMs stay running,syslog reporting WorldInit failed errors and the others. Eventhough case open with Dell Enterprise support on issue with pending log reviews, so happy to find this discussion and new VM KB article. Applying solution to all five currently. Will post if issue occurs after. No news is good news.
Do other people still get a health status in the VIC even after disabling misc.cimenabled and misc.cimoemprovidersenabled? (i've rebooted). I've applied all the latest patches and i'm just concerned that the ESXi server may crash if cim hasnt been disabled properly. On other hosts i've got the hardware health as 'unknown'
I've had the same issue in the past - if you look back in this thread there is also an xml file you can edit.
Although mine are all currently disabled just using the setting within the client.
If i remember correctly the health status item may still show, it just doesnt update the information. I think that if you click update you will also see an error since it cannot connect. You can actually verify if it is truly disabled by logging into tech support mode and running: ps |grep sfcbd. If you do not see any results from that it means that it is not running.
VMware patches are avialable for download to resolve this
i.e. Patch ESX350-200910409-BG
This patch requires the following patches to be installed also
looks like this patch is for esx3.5 and not esx3.5i?
although there is a new firmware image for esx35i available including this fixes (checked this some days ago), i cant see any hint to a fix for the problem described here?
btw. cant access patch downloads anymore. as soon a i click search i get this:
You are not allowed to access this page.
If you feel this is in error, please try again.
If the error persists, Please contact VMware Technical Support.
EDIT: access to patch database is working again ...
download page is working again, thanks.
are you saying ESXe350-200910401-I-SG will fix the problem discussed in this thread? from looking at the summaries section of ESXe350-200910401-I-SG, i cant find any word on this. also do you have the PR number under which this bug is filed? may be its fixed but simply not documented clearly ...
This patch contains the following:
ESXi 3.5 Update 4 hosts with Emulex HBAs might stop responding when accessed through vCenter Server.
This patch reduces the boot time of ESX hosts and should be
applied when multiple ESX hosts detect LUNs used for Microsoft Cluster
After applying this patch, any request for connection with
ESXi 3.5 using cipher suite of 56-bit encryption will be dropped. This
also includes any request for connection to CIM port 5989 on ESXi 3.5.
As a result, browsers that exclusively use cipher suites with 40-bit
and 56-bit encryption cannot connect to ESXi 3.5. Microsoft has made
the Internet Explorer High Encryption Pack available for Internet
Explorer 5.01 and earlier. Internet Explorer 5.5 and later versions
already use 128-bit encryption.
ESXi host fails if its available TCP/IP sockets are exhausted and an NFS Client has a directory mounted.