Mikeluff
Contributor
Contributor

ESXi 3.5 Host Hangs Since U4?

Jump to solution

Hi, I have 9 ESX hosts running in three different clusters. Two clusters are HP DL385 G2's and one cluster is HP DL385 G5 hardaware. A week or so ago I noticed my test cluster and my G5 hardware cluster had issues where the Host would become unreachable from Vcen and the VM's running would drop out - the host would stay pingable and none of the VM's would online a A N other host in the cluster. Connecting to the host console it was slow, you could log on but not re-boot using F11 it would just sit there.

Thought issue might be due to USB sticks provided by HP as there is an issue with them, so HP replaced them all for me. All test boxes and G5 hosts were running U4, but had only been running for a short period of time. As I was replacing the USB's in all hosts apart from one in the production cluster (already replaced a few weeks ago) I though I would apply Update 4 and latest patches from Update Manager. Now all Hosts running U4 have at some point failed in the same way as above - can't figure out what's going on. Have logged call with VMWare but only Gold support (platinum next week I think) so when two hosts failed over the weekend it doesn't look good - latest ESXi U4 patches don't seem to have fixed the issue either... doh!

Anyone seem this before, or having the same issue while I wait on support? Different hardware so it must be a bug that's come with update 4... that one host that is still running U3 hasn't even twitched - so if all else fails I will have to back rev to the same version.

Cheers

0 Kudos
100 Replies
jparnell
Hot Shot
Hot Shot

Yes, I logged a call with HP as we purchased our VMware support through

them. They worked with VMware, asking me to send lots of log files etc.

They explained this is a known bug and VMware will be releasing an

updated soon. The PR numbers are 360229 and 403926. In the meantime,

they have also suggested stopping sfcdb until this is fixed

(/etc/init.d/sfcbd-watchdog stop)

James

0 Kudos
fgw
Contributor
Contributor

yes, logged a call with vmware.

last time i checked with them was on 25.may. the response i got was as follows:

Unfortunately there is no fix identified for this issue yet.
+ The latest entries in the bug show it is still under investigation by engineering.+

anyway, disabling CIM seems to be a valid workaround as i have not seen the problem since i disabled CIM!

although there is a new patch for download available: ESXe350-200905401-O-BG.zip dated 28/5/2009 i could not find any hint on our CIM issue. doubt its fixed with this patch.

so, we still have to wait.

0 Kudos
Mikeluff
Contributor
Contributor

Do you have the internal HP number?

0 Kudos
jparnell
Hot Shot
Hot Shot

My HP call ref no. is 1604831078

0 Kudos
TonyCoffman
Contributor
Contributor

We still have an open ticket for a very similar issue and VMware recommended the same fix (disable CIM).

Last discussion I had with them was on Monday and they told me pretty much the same thing - Engineering is investigating and they don't know when it will be resolved.

In the meantime, our hosts have been stable.

ESXi - build 158869 runnning on Dell hardware (R900).

Regards,

--Tony

0 Kudos
Mikeluff
Contributor
Contributor

Interesting so its not only HP hardware that's impacted then. They must be able to look through the changes implemented in U4 to identify the cause.

0 Kudos
rkobiske
Enthusiast
Enthusiast

Mikeluff: I am having the same issue with my HP hardware and ESX 3.5 Update 4. We are running the installable version though. Every once in a while our new DL380G6 host will become disconnected, and disconnect all of the VMs with it. I have to power off the host to get the VMs to come up on a different host. I've opened a case with HP about this, that case has now made it way over to VMware and they are looking into it. They found that the service console was running out of memory, and i was getting memory errors in my hostd.log file. All three HP servers in the cluster were getting these memory errors (two DL380G5s and one DL380G6), but only the G6 would become disconnected after time. HP had me disable the CIM agents and the pegasus service as there is a know memory leak to be fixed in update 5 (no ETA). Even after I disabled the CIM and pegasus services i continued to have memory issues. HP/VMware are still looking in to the issue. I seem to get the memory errors after vmotions have occured to the box. The hostd process seems to consume memory and never releases the memory. To work around my issue, i've been restarting the mgmt-service on my boxes when i start to get the memory errors.

0 Kudos
macka
Contributor
Contributor

I can confirm that this error also occurs on a brand spanking new Dell 2950 with ESXi 4. It does not occur with 3.5 U1, but I have not tested later releases of 3.5.

Regards,

Elliot.

0 Kudos
jmcdonald1
VMware Employee
VMware Employee

Hi Macka, et al,

Have you opened a Service Request? If so can you provide it to me? If not, can you please open one and let me know the number. I have been working closely with VMware development on a number of cases similar to this and we would like to diagnose as many cases as possible.

Cheers,

/Jonathan

0 Kudos
Mike625
Contributor
Contributor

We have been having the same issue on IBM x3850 M2 hardware since early to mid-May. Support advised us to disable cim oem providers (they already were) and to stop sfcbd-watchdog, wsmand and slpd. We've done this, but not enough time has passed to yet say whether it has helped. Of course we have no hardware monitoring until this is fixed. Our case is SR# 1322911471.

On the f2 console, the login dialog will display. Upon entering credentials, it disapears and no login occurs. Attempting to login again will not display the login dialog. Can still get to f11, f12 and f1 consoles. During this time all guests are unreachable and HA does not kick in - guests just stay down until the host is rebooted. Logging into f1 and trying to stop the above 3 processes results in a hung f1 console upon stopping slpd - guests remain unreachable. Can still change to alternate consoles. Starting to wonder if ESX is a better choice...

0 Kudos
rkobiske
Enthusiast
Enthusiast

Ditto on the login dialog hangs. On mine, i enter the username and hit enter and the password prompt will never come up either. The box has completly hung. I'm able to ping the box, but not able to access any of the VMs on the box or access the box itsself. SR1277931241.

VMware/HP claims this is due to the out of memory warning messages i continually get in my hostd.log file after a certain amount of time, and a certain amount of vmotions. They cant figure out why i'm getting memory warning messages either.

This has caused a lot of issues due to the fact HA never kicks in when the boxes hang.

0 Kudos
Mikeluff
Contributor
Contributor

All,

I am currently working with VMWare Escalation support on this issue. They have been monitoring my test environment since Tuesday, and we are waiting for it to fail, at which point they will call engineering who will connect into my system. I can't give you any more information than that other than the fact they are working on it - and now have access to a system that "hopefully" will fail and give them the information they require to diagnose and resolve the issue.

I will update you with anything I can along the way.

0 Kudos
jmcdonald1
VMware Employee
VMware Employee

Thanks for the information everyone, it has provided very helpful. I am one of the Escalation resources that has been working on these issues.

I have a questionfor everyone, which may or may not be related but is important. What brand and model of HBA's are in the servers that have been affected?

Cheers,

/Jonathan

0 Kudos
stefanjansson
Contributor
Contributor

Jag har semester,åter 29 jun

I´m on vacation ,will be back on June 29

mvh /regards

// Stefan

0 Kudos
Mikeluff
Contributor
Contributor

Emulex LPe11000, although your Engineering team have all my info now.

0 Kudos
fgw
Contributor
Contributor

using HP Emulex LPe1105 Fibre Channel HBA (HP product number 403621-B21) in HP bl460c blade servers!

0 Kudos
jparnell
Hot Shot
Hot Shot

We're using HP FC2142 / Emulex LPe1150.

0 Kudos
Mike625
Contributor
Contributor

We're using IBM branded Emulex LPe11002

0 Kudos
jmcdonald1
VMware Employee
VMware Employee

Aweseome, thanks for the response everyone. I just wanted to quickly update this tread to say that we believe we have found the problem and are currently testing a fix for the issue. I will post again once we have something more official on this.

Cheers,

/Jonathan

0 Kudos
Mike625
Contributor
Contributor

Jonathan, thank you very much for the update. We are getting a little frustrated over the lack of official communication (no kb doc, support case in limbo, etc) on this issue and what kind of a timeline we are looking at. Do you have anything more you can share? Is this bug documented in a bug tracker where we can monitor progress?

Thanks,

Mike

0 Kudos