Mikeluff
Contributor
Contributor

ESXi 3.5 Host Hangs Since U4?

Jump to solution

Hi, I have 9 ESX hosts running in three different clusters. Two clusters are HP DL385 G2's and one cluster is HP DL385 G5 hardaware. A week or so ago I noticed my test cluster and my G5 hardware cluster had issues where the Host would become unreachable from Vcen and the VM's running would drop out - the host would stay pingable and none of the VM's would online a A N other host in the cluster. Connecting to the host console it was slow, you could log on but not re-boot using F11 it would just sit there.

Thought issue might be due to USB sticks provided by HP as there is an issue with them, so HP replaced them all for me. All test boxes and G5 hosts were running U4, but had only been running for a short period of time. As I was replacing the USB's in all hosts apart from one in the production cluster (already replaced a few weeks ago) I though I would apply Update 4 and latest patches from Update Manager. Now all Hosts running U4 have at some point failed in the same way as above - can't figure out what's going on. Have logged call with VMWare but only Gold support (platinum next week I think) so when two hosts failed over the weekend it doesn't look good - latest ESXi U4 patches don't seem to have fixed the issue either... doh!

Anyone seem this before, or having the same issue while I wait on support? Different hardware so it must be a bug that's come with update 4... that one host that is still running U3 hasn't even twitched - so if all else fails I will have to back rev to the same version.

Cheers

0 Kudos
100 Replies
jparnell
Hot Shot
Hot Shot

Just wondering if anyone has any news on this?

James

0 Kudos
fgw
Contributor
Contributor

been busy with some other problems here, but got a response from vmware support today:

There is now a bug open against this issue and engineering are investigating this.

get back when i have more detailed information ...

0 Kudos
Mikeluff
Contributor
Contributor

Since I upgraded my BIOS and Disabled the CIM thing seem stable - fingers crossed.... have not had an update from my Support case... please keep us in the loop

0 Kudos
here4now
Contributor
Contributor

OK.. Just to add more to the issue. I have a pair (well 7) HP dl580 g5 and a few Dells

We had a server crash (Host) and of course it froze all the guests. On a reboot of the host all came back to life

But.

The HP that crashes is running U4 with the HP sim installed from Vmware.

so after teh crash I started to look more into the errors and it seems this might be just noise from the machine. I also have to same hardware with out the HP sim installed and same errors in syslog.

as well a Dell 1950 with the Vmware dell openmanger plugins install.. and again I see pretty much the same errors on my syslog server..

I don't think the issues are with the vendor tools but if anything a issue in U4.. I believe my crash issue was with one guest set to unlimited for CPU and Memory.

anyone seen something like that??

0 Kudos
donnieq
Enthusiast
Enthusiast

here4now: You are seeing the same syslog messages on each system since the issue found and being tracked in this post is with CIM "Common Information Model". This feature is built into VMware ESX/ESXi and is enabled by default. It provides information about the server's hardware status under the Configuration tab / Health Status in vCenter.

Look back at my earlier posts and you will find the steps to disable CIM "Common Information Model" on your servers, regardless if they have HP SIM (System Insigh Management) Agents installed.

Don Q.

0 Kudos
here4now
Contributor
Contributor

Thanks donnieq

If I shut this CIM down. will the machine still comm with my syslog server with issues??

0 Kudos
Mikeluff
Contributor
Contributor

CIM should have nothing to do with the syslog server, so you should be OK there.

0 Kudos
fgw
Contributor
Contributor

don is right here!

it looks like this issue is related toCIM. vmware suppport is currently analyzing this. they found out the sfcbd process is eating up memory.

until vmware is coming out with a fix for this, disabling CIM should be a valid workaround.

this can be done in two ways:

without reboot of the server:

  1. run the command /etc/init.d/sfcbd-watchdog stop from the console

  2. disable CIM in VIClient or VCServer in configuration -> advanced settings -> Misc -> set Misc.CimEnabled to 0

server reboot required:

  1. disable CIM in VIClient or VCServer in configuration -> advanced settings -> Misc -> set Misc.CimEnabled to 0

  2. reboot the server

as don already pointed out, CIM comes with standard ESX3i and has nothing to do with HP agents. my servers are running with plain vmware ESX3i and i have run into this issue.

also disabling CIM has no effect on syslog. i use a remote syslog server here and it continues logging messages after CIM is disabled.

0 Kudos
tschmidt
Contributor
Contributor

I am on u3 and have had the reliabilty issues. Since disabling CIM, I have had one esxi host drop off the network (the management interface).

This did not affect the running vm's. I ended up having to reboot the host to get it back on the network. The symptoms that I usually have is the hosts going disconnected and anything done in vcenter to a host or its vm's taking for ever or eventually timing out.

0 Kudos
Mikeluff
Contributor
Contributor

Suggest a BIOS update also seems to resolve this issue. On my production cluster I have both updated the BIOS and disabled the CIM and it's been stable ever since... On my test cluster which uses the same hardware and was having the same issue I have only upgraded the BIOS and that has also been stable. One thing I noticed was that when the new bios was installed where there used to be an External Temp Sensor displayed on the Health screen that is no longer there - so the BIOS must correct some issues. Maybe with U4 that just didnt work correctly....

Still waiting for something official to come from VMWare though - patch etc.

0 Kudos
fgw
Contributor
Contributor

as already written in my previous post:

from the response i got from vmware support so far, the problem is related to CIM. disabling CIM shold be the workaround for this until vmware comes up with a fix!

although a bios update should not hurt, i dont think i helped to fix this. disabling CIM was what helped your server ...

i have now 5 ESX3i servers running, with CIM disabled and had no problem for several days.

0 Kudos
Mikeluff
Contributor
Contributor

It must have helped as I was getting the same issue on my test cluster which I have not disabled the CIM on, only updated the BIOS and I have had no issues since.

I agree this is a CIM issue, but what I am saying is that in my case after a BIOS update some changes had been made by HP to the CIM and so this could be part of the issue.

0 Kudos
Mikeluff
Contributor
Contributor

I spoke too soon, both test hosts failed at different times over the weekend. Does anyone have any updates from HP on this issue, VMWare say for me to contact them.

0 Kudos
jparnell
Hot Shot
Hot Shot

I have a support call open with HP at the moment. They have had to open one with VMware as they can't get to the bottom of it themselves.

Are you saying that Vmware are washing their hands of it?

0 Kudos
Mikeluff
Contributor
Contributor

No, they say they are working with HP on the issue but for me to also log a call with HP (on the phone at the moment). But they can't say when it might be resolved.

0 Kudos
donnieq
Enthusiast
Enthusiast

Mikeluff: When the hosts "failed" did they become disconnected from vCenter and the guests continue to run normally or were the guests unresponsive? We had one instance since disabling CIM where a host became disconnected from vCenter due to the management agents stopping. It was a quick fix to restart them; however, I still raised another case with VMware. I suspect that the management agents stopping is a less frequent issue, at least in our environment, not related to the CIM issue which certainly has cleared up.

0 Kudos
Mikeluff
Contributor
Contributor

The hosts became disconnected, and the guests failed.... re-starting the services fixed this. But it caused the guests to reboot and come online on ANOther host while the services restarted.

0 Kudos
fgw
Contributor
Contributor

mikeluff, did i get this correct: the host failed, was upgraded to a new bios version, but had NOT STOPPED CIM?

if so, this makes me even more believe vmware support currently analyzing this issue is on the correct way as they said to me its related to cim eating up memory!

also the reason the guests are coming up on a different host after the failure might be ha is enabled! have you enabled ha in your cluster?

to me the only workaround until vmware is coming up with a fix is to stop cim!

i have disabled cim on all my hosts for about a week now and did not observer this since then!

i would recommend disabling cim as explained by various people and also in one of my last posts until we have a real solution for this!

0 Kudos
Mikeluff
Contributor
Contributor

Yes HA is enabled and so would expect the VM's to come up on another host.

The main issue is that fact that because the host doesn't either die or fail correctly the hosts go into limbo as well and so won't migrate or online on another host until either the services are restarted or the host is re-booted.

Must be a CIM issue. I have disabled the CIM on my production cluster and it's fine, but I wanted to prove to VM that the bios update wasn't the fix for the issue and so didn't apply on my test cluster.

0 Kudos
Mikeluff
Contributor
Contributor

Has anyone heard anything back from either VMWare or HP on this issue? VMWare have closed by ticket as they say they are working on it and just to wait for a SR to be released? HP are still looking but say they cant find anything on ESXi 3.5 U4 issues - I thought some of you had also logged calls?

0 Kudos