Hi, I have 9 ESX hosts running in three different clusters. Two clusters are HP DL385 G2's and one cluster is HP DL385 G5 hardaware. A week or so ago I noticed my test cluster and my G5 hardware cluster had issues where the Host would become unreachable from Vcen and the VM's running would drop out - the host would stay pingable and none of the VM's would online a A N other host in the cluster. Connecting to the host console it was slow, you could log on but not re-boot using F11 it would just sit there.
Thought issue might be due to USB sticks provided by HP as there is an issue with them, so HP replaced them all for me. All test boxes and G5 hosts were running U4, but had only been running for a short period of time. As I was replacing the USB's in all hosts apart from one in the production cluster (already replaced a few weeks ago) I though I would apply Update 4 and latest patches from Update Manager. Now all Hosts running U4 have at some point failed in the same way as above - can't figure out what's going on. Have logged call with VMWare but only Gold support (platinum next week I think) so when two hosts failed over the weekend it doesn't look good - latest ESXi U4 patches don't seem to have fixed the issue either... doh!
Anyone seem this before, or having the same issue while I wait on support? Different hardware so it must be a bug that's come with update 4... that one host that is still running U3 hasn't even twitched - so if all else fails I will have to back rev to the same version.
Cheers
VMware patches are avialable for download to resolve this
http://kb.vmware.com/kb/1013132
i.e. Patch ESX350-200910409-BG
This patch requires the following patches to be installed also
ESX350-200910401-SG (http://kb.vmware.com/kb/1013124)
ESX350-200910402-BG (http://kb.vmware.com/kb/1013125)
Found another thread with the same issue - that' wasn't there when i looked??!!
208112
Have you made any progress on this issue? I have similar issues on esxi hosts that are hp585 g5's with esxi 3.5 u3.
I have a call outstanding with cm support. Indications so far and from looking at other posts is that it might be an issue with the HP agents. To solve, we think go to HP website and download the latest ISO image from there and apply it.... Ensure you back up your config first. Then use update manager to apply the latest patches. I have done this on my test cluster but since the failure is random I don't know if its fixed. I will apply to production tomorrow as well as updating the bios to the latest rev.
Is anyone seeing the following entries in /var/log/messages and/or syslog output?
UserThread: ###: Peer table full for sfcbd
World: vm #####: ####: WorldInit failed: trying to cleanup.
World: vm #####: ###: init fn user failed with: Out of resources!
We received an indication from VMware that these errors and quite possibly the instability are due to an issue with CIM. VMware provided steps to disable CIM and thus far the errors have not returned. We'll continue to monitor the stability on the BL460c G1's.
We have seen the same messages. Did you load the hp version of hp esxi ?
UserThread: 406: Peer table full for sfcbd
Apr 26 02:13:36 vmkernel: 0:00:01:18.420 cpu14:1682)WARNING:World: vm 1779: 911: init fn user failed with: Out of resources!
Apr 26 02:13:36 vmkernel: 0:00:01:18.420 cpu14:1682)WARNING: World: vm 1779: 1776: WorldInit failed: trying to cleanup.
Can you forward the instructions on disabling cim ?
See the 2nd paragraph here on disabling CIM - http://www.vm-help.com/esx/esx3i/disable_CIM_on_startup.php
I'm seeing the same error's in my syslog output. I have now disabled both my CIM settings on the hosts - do you know if VMWare are working with HP to resolve this issue?
Yes, we installed the latest version of ESX 3i U4 from HP () on SAS drives and applied the 10-Apr and 29-Apr patches via VMware Update Manager.
We used the following steps to disable CIM:
1.) On each host, under the configuration tab, select Advanced Settings, select Misc, and set Misc.CIMEnabled to 0.
2.) Put host into maintenance mode
3.) Via unsupported mode (ALT-F1, type unsupported, enter root password)
a.) /etc/init.d/sfcbd-watchdog stop
b.) /etc/init.d/wsmand stop
c.) /etc/init.d/slpd stop
d.) Edit /etc/vmware/hostd/config.xml with VI
e.) Set the tag at path "plugins" -> "cimsvc" -> "enabled" to false
4.) Reboot the host via vCenter
If possible, I'd also open a case with VMware to better track and resolve this issue.
Isn't it the Misc.CimOemProvidersEnabled that you need to set to 0 as well, or is this the same as what you have done from the unsupported mode?
Yes, Misc.CimOemProvidersEnabled = 0, in Advanced Settings via the GUI appears to be the same as changing the config.xml file. The Health Status readings should show "Unknown" when CIM is disabled.
Humm, they don't show disabled but both settings are set to 0. The error's have stopped appearing in the syslog though... still nothing from VM Support - no contact in 2 days, and I left them a voicemail yesterday. Will have to call them again I think...
same issue here:
was running esx3.5 until i discovered esx3i. decided to migrate to esx3i.
the servers we are using are hp bl460c g1 blades. installed esx3i onto the quickly ordered hp usb flash drives. used the vmware installable and extracted the image cause i prefered to run without hp agents and wanted to get rid of the hp agents.
running esx3i u4 and the last two patches: ESXe350-200904201-O-SG and ESXe350-200904401-O-SG
all over sudden, one out of three servers appeared unreachable in vc. i could ping but not log in using the console f2. the backdoor alt-f1 worked. some guests responded to ping but most of them did not. the only way to resolve this was a reset of the server.
opened up a support call at vmware. engineer looked at the available logfiles (saved them away before i resetted the server) but was not able to find anything.
today, 2 days later, the same happend again!
as i dont use the hp agents this problem is not related to the agents!
though, i'm using hp usb flash drives! may be this usb drives have an issue? whats the story behind the replaced hp usb flash drives?
meanwhile i use remote syslog to atleast have logfiles available after a restart!
thinking of going back to esx3.5 and dump the idea of using esx3i embedded..
though, i'm using hp usb flash drives! may be this usb drives have an issue? whats the story behind the replaced hp usb flash drives?
If you have a green metal HP USB key then they have a tendancy to go corrupt.
thanks for the quick reply!
no, my keys are black plastic!
checked my logfiles in the meantime and found the described messages:
2009-05-07 00:10:53 User.Error 10.90.4.152 LSIESG: LSIESG:INTERNAL :: StorelibManager::createDefaultSelfCheckSettings - failed to get TopLevelSystem
2009-05-07 00:10:53 User.Error 10.90.4.152 sfcbd: INTERNAL StorelibManager::createDefaultSelfCheckSettings - failed to get TopLevelSystem
2009-05-07 00:10:53 Local6.Warning 10.90.4.152 vmkernel: 0:03:15:30.941 cpu6:1713)WARNING: UserThread: 406: Peer table full for sfcbd
2009-05-07 00:10:53 Local6.Warning 10.90.4.152 vmkernel: 0:03:15:30.941 cpu6:1713)WARNING: World: vm 49111: 911: init fn user failed with: Out of resources!
2009-05-07 00:10:53 Local6.Warning 10.90.4.152 vmkernel: 0:03:15:30.941 cpu6:1713)WARNING: World: vm 49111: 1776: WorldInit failed: trying to cleanup.
Additionally, black USB flash drives without "SMSE" printed on them are vulnerable. Details: http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c01605187
fgw: Please follow the steps I provided earlier in this thread to eliminate the error messages at hand. There's a difference between CIM and the HP SIM agents (for which you stripped out).
I had also heard bad things about the usb keys, but I did not use them and I still have the issue. I used the esx 3.5i u3 , the hp version.
donnieq,
this document unfortunately is not available, or at least the link is not valid anymore ...
will try to disable CIM and see what happens ...
tschmidt,
so you are running your server from harddisk, or you are using a differnt type of usb flash drive?
I am waiting for vmware enginers to contact me to use my test system to reproduce the error. I used to use 3.5 but moved to the I version for simplified configuration and less security issues. This is the first issues I have had since moving to it, so don't really want move back.... Plus who's to say you won't get issues like this in the future.
HP are stopping selling the USB keys they now provide a list of certified ones to buy.