VMware Cloud Community
Mikeluff
Contributor
Contributor
Jump to solution

ESXi 3.5 Host Hangs Since U4?

Hi, I have 9 ESX hosts running in three different clusters. Two clusters are HP DL385 G2's and one cluster is HP DL385 G5 hardaware. A week or so ago I noticed my test cluster and my G5 hardware cluster had issues where the Host would become unreachable from Vcen and the VM's running would drop out - the host would stay pingable and none of the VM's would online a A N other host in the cluster. Connecting to the host console it was slow, you could log on but not re-boot using F11 it would just sit there.

Thought issue might be due to USB sticks provided by HP as there is an issue with them, so HP replaced them all for me. All test boxes and G5 hosts were running U4, but had only been running for a short period of time. As I was replacing the USB's in all hosts apart from one in the production cluster (already replaced a few weeks ago) I though I would apply Update 4 and latest patches from Update Manager. Now all Hosts running U4 have at some point failed in the same way as above - can't figure out what's going on. Have logged call with VMWare but only Gold support (platinum next week I think) so when two hosts failed over the weekend it doesn't look good - latest ESXi U4 patches don't seem to have fixed the issue either... doh!

Anyone seem this before, or having the same issue while I wait on support? Different hardware so it must be a bug that's come with update 4... that one host that is still running U3 hasn't even twitched - so if all else fails I will have to back rev to the same version.

Cheers

Reply
0 Kudos
1 Solution

Accepted Solutions
paudieo
VMware Employee
VMware Employee
Jump to solution

VMware patches are avialable for download to resolve this

http://kb.vmware.com/kb/1013132

i.e. Patch ESX350-200910409-BG

This patch requires the following patches to be installed also

ESX350-200910401-SG (http://kb.vmware.com/kb/1013124)

ESX350-200910402-BG (http://kb.vmware.com/kb/1013125)

View solution in original post

Reply
0 Kudos
100 Replies
Mikeluff
Contributor
Contributor
Jump to solution

Found another thread with the same issue - that' wasn't there when i looked??!!

208112

Reply
0 Kudos
tschmidt
Contributor
Contributor
Jump to solution

Have you made any progress on this issue? I have similar issues on esxi hosts that are hp585 g5's with esxi 3.5 u3.

Reply
0 Kudos
Mikeluff
Contributor
Contributor
Jump to solution

I have a call outstanding with cm support. Indications so far and from looking at other posts is that it might be an issue with the HP agents. To solve, we think go to HP website and download the latest ISO image from there and apply it.... Ensure you back up your config first. Then use update manager to apply the latest patches. I have done this on my test cluster but since the failure is random I don't know if its fixed. I will apply to production tomorrow as well as updating the bios to the latest rev.

Reply
0 Kudos
donnieq
Enthusiast
Enthusiast
Jump to solution

Is anyone seeing the following entries in /var/log/messages and/or syslog output?

UserThread: ###: Peer table full for sfcbd

World: vm #####: ####: WorldInit failed: trying to cleanup.

World: vm #####: ###: init fn user failed with: Out of resources!

We received an indication from VMware that these errors and quite possibly the instability are due to an issue with CIM. VMware provided steps to disable CIM and thus far the errors have not returned. We'll continue to monitor the stability on the BL460c G1's.

Reply
0 Kudos
tschmidt
Contributor
Contributor
Jump to solution

We have seen the same messages. Did you load the hp version of hp esxi ?

UserThread: 406: Peer table full for sfcbd

Apr 26 02:13:36 vmkernel: 0:00:01:18.420 cpu14:1682)WARNING:World: vm 1779: 911: init fn user failed with: Out of resources!

Apr 26 02:13:36 vmkernel: 0:00:01:18.420 cpu14:1682)WARNING: World: vm 1779: 1776: WorldInit failed: trying to cleanup.

Can you forward the instructions on disabling cim ?

Reply
0 Kudos
Dave_Mishchenko
Immortal
Immortal
Jump to solution

See the 2nd paragraph here on disabling CIM - http://www.vm-help.com/esx/esx3i/disable_CIM_on_startup.php

Reply
0 Kudos
Mikeluff
Contributor
Contributor
Jump to solution

I'm seeing the same error's in my syslog output. I have now disabled both my CIM settings on the hosts - do you know if VMWare are working with HP to resolve this issue?

Reply
0 Kudos
donnieq
Enthusiast
Enthusiast
Jump to solution

Yes, we installed the latest version of ESX 3i U4 from HP () on SAS drives and applied the 10-Apr and 29-Apr patches via VMware Update Manager.

We used the following steps to disable CIM:

1.) On each host, under the configuration tab, select Advanced Settings, select Misc, and set Misc.CIMEnabled to 0.

2.) Put host into maintenance mode

3.) Via unsupported mode (ALT-F1, type unsupported, enter root password)

a.) /etc/init.d/sfcbd-watchdog stop

b.) /etc/init.d/wsmand stop

c.) /etc/init.d/slpd stop

d.) Edit /etc/vmware/hostd/config.xml with VI

e.) Set the tag at path "plugins" -> "cimsvc" -> "enabled" to false

4.) Reboot the host via vCenter

If possible, I'd also open a case with VMware to better track and resolve this issue.

Mikeluff
Contributor
Contributor
Jump to solution

Isn't it the Misc.CimOemProvidersEnabled that you need to set to 0 as well, or is this the same as what you have done from the unsupported mode?

Reply
0 Kudos
donnieq
Enthusiast
Enthusiast
Jump to solution

Yes, Misc.CimOemProvidersEnabled = 0, in Advanced Settings via the GUI appears to be the same as changing the config.xml file. The Health Status readings should show "Unknown" when CIM is disabled.

Reply
0 Kudos
Mikeluff
Contributor
Contributor
Jump to solution

Humm, they don't show disabled but both settings are set to 0. The error's have stopped appearing in the syslog though... still nothing from VM Support - no contact in 2 days, and I left them a voicemail yesterday. Will have to call them again I think...

Reply
0 Kudos
fgw
Enthusiast
Enthusiast
Jump to solution

same issue here:

was running esx3.5 until i discovered esx3i. decided to migrate to esx3i.

the servers we are using are hp bl460c g1 blades. installed esx3i onto the quickly ordered hp usb flash drives. used the vmware installable and extracted the image cause i prefered to run without hp agents and wanted to get rid of the hp agents.

running esx3i u4 and the last two patches: ESXe350-200904201-O-SG and ESXe350-200904401-O-SG

all over sudden, one out of three servers appeared unreachable in vc. i could ping but not log in using the console f2. the backdoor alt-f1 worked. some guests responded to ping but most of them did not. the only way to resolve this was a reset of the server.

opened up a support call at vmware. engineer looked at the available logfiles (saved them away before i resetted the server) but was not able to find anything.

today, 2 days later, the same happend again!

as i dont use the hp agents this problem is not related to the agents!

though, i'm using hp usb flash drives! may be this usb drives have an issue? whats the story behind the replaced hp usb flash drives?

meanwhile i use remote syslog to atleast have logfiles available after a restart!

thinking of going back to esx3.5 and dump the idea of using esx3i embedded..

Reply
0 Kudos
Dave_Mishchenko
Immortal
Immortal
Jump to solution

though, i'm using hp usb flash drives! may be this usb drives have an issue? whats the story behind the replaced hp usb flash drives?

If you have a green metal HP USB key then they have a tendancy to go corrupt.

Reply
0 Kudos
fgw
Enthusiast
Enthusiast
Jump to solution

thanks for the quick reply!

no, my keys are black plastic!

checked my logfiles in the meantime and found the described messages:

2009-05-07 00:10:53 User.Error 10.90.4.152 LSIESG: LSIESG:INTERNAL :: StorelibManager::createDefaultSelfCheckSettings - failed to get TopLevelSystem

2009-05-07 00:10:53 User.Error 10.90.4.152 sfcbd: INTERNAL StorelibManager::createDefaultSelfCheckSettings - failed to get TopLevelSystem

2009-05-07 00:10:53 Local6.Warning 10.90.4.152 vmkernel: 0:03:15:30.941 cpu6:1713)WARNING: UserThread: 406: Peer table full for sfcbd

2009-05-07 00:10:53 Local6.Warning 10.90.4.152 vmkernel: 0:03:15:30.941 cpu6:1713)WARNING: World: vm 49111: 911: init fn user failed with: Out of resources!

2009-05-07 00:10:53 Local6.Warning 10.90.4.152 vmkernel: 0:03:15:30.941 cpu6:1713)WARNING: World: vm 49111: 1776: WorldInit failed: trying to cleanup.

Reply
0 Kudos
donnieq
Enthusiast
Enthusiast
Jump to solution

Additionally, black USB flash drives without "SMSE" printed on them are vulnerable. Details: http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c01605187

Reply
0 Kudos
donnieq
Enthusiast
Enthusiast
Jump to solution

fgw: Please follow the steps I provided earlier in this thread to eliminate the error messages at hand. There's a difference between CIM and the HP SIM agents (for which you stripped out).

Reply
0 Kudos
tschmidt
Contributor
Contributor
Jump to solution

I had also heard bad things about the usb keys, but I did not use them and I still have the issue. I used the esx 3.5i u3 , the hp version.

Reply
0 Kudos
fgw
Enthusiast
Enthusiast
Jump to solution

donnieq,

this document unfortunately is not available, or at least the link is not valid anymore ...

will try to disable CIM and see what happens ...

tschmidt,

so you are running your server from harddisk, or you are using a differnt type of usb flash drive?

Reply
0 Kudos
Mikeluff
Contributor
Contributor
Jump to solution

I am waiting for vmware enginers to contact me to use my test system to reproduce the error. I used to use 3.5 but moved to the I version for simplified configuration and less security issues. This is the first issues I have had since moving to it, so don't really want move back.... Plus who's to say you won't get issues like this in the future.

HP are stopping selling the USB keys they now provide a list of certified ones to buy.

Reply
0 Kudos