Re: ESX 3.0.2 host spontaneously & completely unre...

jftwp · ‎10-11-2007

Here's a new and very troubling isssue. At approx. 6:18 last night, one of our ESX servers in a 6-node cluster (all HP Proliant DL360G5 systems running ESX 3.0.2) suddenly went into a "Host is not responding" state in VirtualCenter. From that moment forward, the following applied:

No VI client connectivity to that host either via VirtualCenter or direct via VI3 client.
No http response.
No SSH response.
No at-console response to console/specialized VM that provides host interaction.
Could ping the host / service console address, however. Only sign of 'life'. Sigh.

iLo (v2) diagnostics on that host show no signs of foul play whatsoever on the hardware front. Fans all running, local disks fine, temps reasonably low, no sign of ASR/reset in logs, etc.

A complete mystery! All other hosts in this same cluster have literally identical hardware, firmware levels, and ESX 3.0.2, and have no problems.

There were 4 VMs running at the time the host decided to virtually drop off the planet---and if there is any good news it's that they all remained running and quite functional as I type this nearly 24 hours after the host 'died' in this sense, but all VMs show up as 'disconnected' in VC (not surprisingly, given the 'not responding' state of their host in VC as well), so there's no way to manage them at all, vmotion them, shut them down via VC, etc.

So, tonight, after advising users of 'maintenance', I must go into each guest, shut each down gracefully, bring the host down hard (have to since it's completely unresponsive in the software sense), bring it back up, confirm its guests are up, subsequently migrate guests to other hosts in the cluster, put host into maint mode, remove host from the cluster cuz I don't want any part of that host until root cause of this quasi-disaster is determined-all so I can extract the vm support files for submission to VMware per the ticket I opened with them this morning and after the engineer came in via WebEx, took a look around, confirmed all of what I noted above, and agreed that this is all that can be done right now. They must, of course, have the support output to~~HOPEFULLY~~--find root cause. This type of outage/crash of an ESX host is of course the classic 'unacceptable' (from both management and my own mouth) because it negates all of the HA/DRS/Vmotion benefits and, well, it's unacceptable/unanticipated. Ugh. Not good.

Hopefully this issue will turn out to be a hardware problem and I thusly won't be able to blame VMware, but as I confirmed with the engineer assigned to this case, hardware doesn't appear to be the case at all. Still, after the host is back up and out of the cluster, etc. I'm going to run a 'fsck' command on the ESX host just to see if there's any possible problem/s with the disk/array/filesystem (per the VMware tech's suggestion).

Regardless of outcome, problems such as this are particularly bad because VMs are 'stuck' in a 'disconnected' state and cannot be migrated via Vmotion. I do hope we can get to the root cause of why a seemingly fine ESX host with a very stable history would suddenly/spontaneously lose all communications with the console, SSH, HTTP and VirtualCenter/VI client ports. It basically 'died' in that sense, without any evidence of ever going 'down'. Very disconcerting/frustrating.

Has anyone ever heard of an ESX host just 'dying' like this with these symptoms? Any suggestions/thoughts? Hopefully, the ESX logs will confirm what happened when VirtualCenter initially reported the host system as being down.

JDLangdon · ‎10-12-2007

I have been experiencing the same issue with a 3.01 server and I hvae let my server run for days and sometimes weeks in this "dead" state without having any effect on the VM's.

I opened a SR with Vmware who, after looking at my log files, said that they did not see anything out of the ordinary other then that I had Nagios on the COS as a means to monitor the system. I had Nagios installed on 7 Esx host servers but only one experienced this issue.

I removed Nagios and have not experienced this issue since, however, my SR remains open for a few more weeks just in case the issue reoccures.

Jason

dfgl · ‎10-12-2007

We have had the same problems in the past caused by three different issues:

1. Do you have the HP agents installed? We found that restarting them overnight using cron stopped the issue from happening.

2. The firware version on our hosts for the hba's was out of date - the firmware being used was the same on all eight hosts but frustratingly only ever caused a problem on one host - flashing the hbas sorted the problem.

3. One guest had redo files attached to each of its vmdk files - these had grown to ver 1GB & resulted in this issue - commiting the logs solved the issue.

Hope any of the above helps.

Texiwill · ‎10-12-2007

Hello,

It sounds like the VM running the Service Console has crashed. This implies that something running in the SC is the culprit. I would check your HPASM agents and anything else that you have running. I would also connect up a physical console and let it run so you can capture the Crash OOPs after it happens.

Best regards,

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

jftwp · ‎10-12-2007

Thanks for all the responses. So, between Nagios (hey JD, is Nagios considered unsupported by VMware?) and HP SIM agents (which have always stirred up some debate going back to the ESX 2.x days), it would seem that the console guest OS/VM isn't entirely happy with them---not around the clock, anyway.

dfgl, not sure we want to get into the business of cron-based workarounds for possibly suspect SIM agents. The firmware of the Qlogic HBA's is no doubt dated to some extent (we have 1.06---I haven't even checked yet for latest, but will shortly), but to have the Qlogic firmware be the cause of the service console VM crashing seems odd; but we're all used to 'oddities' (and Murphy!) in this business, so can't/won't rule it out! Regarding redo files within a guest, also seems odd that would kill the SC, but won't rule that out either; will check.

Texiwill, I'll relay your thoughts to the engineer on my case as well. Thank you.

Will keep all posted.

JDLangdon · ‎10-12-2007

Thanks for all the responses. So, between Nagios (hey JD, is Nagios considered unsupported by VMware?)

You are correct, Nagios is considered unsupported by VMware when installed within the service console because there is a possibility that it could cause a memory leak. This memory leak is what "kills" the COS.

Jason

jftwp · ‎10-12-2007

I'm not adverse to removing the HP SIM agents entirely from our ESX hosts in all clusters, and instead relying on VirtualCenter for host alarm conditions. Just to rule out '3rd party' management software that causes who-knows-what to happen. Still, not at that point just yet... waiting to see if support can pinpoint cause of the console OS vm crashing.

JDLangdon · ‎10-12-2007

Are the HP SIM agents listed on the approved/supported COS software list? I know Nagios isn't but that doesn't mean HP SIMs aren't supported.

jftwp · ‎10-12-2007

Not sure. Is there a definitive list? The closest I could find was VMware's whitepaper on 3rd party software in the SC, in which VMware is in essence saying "You can, but you really shouldn't overall."

MschulzInWiscon · ‎12-03-2007

I've run into this same thing with the same symptoms, different hardware. I have an HP C-class blade center. There are 6 blades in the enclosure, 4 are identical, and 2 are identical. I have this happen only with one of my two identical ones (BL460c G1, dual 2.66GHz quad-cores 5355 models, 16 GB RAM, GLogic HBAs). This has never happened on my 4 DL380's or the other blades. It just occurs on this one blade. The only 3rd party software that is loaded is the Navisphere Agent for our EMC SAN. No Nagios anywhere on the network (though I was thinking of looking into it). I've been lucky in that this only happens on a much less business-critical cluster only with XP VM's in it. This blade was installed fresh with ESX 3.0.1. I've since upgraded it to 3.0.2, and just today, Update 2 with patches 1002424, 1002425, and 1002429. Each time, was hoping it would be something fixed with the latest upgrade. I also would like to avoid cron jobs to reboot the hosts. In my mind, we shouldn't have to do that. I haven't looked at the version of the HBA firmware, but the blade this happens on is only about 3 months old (and it's happened only about a month or so after we got it). Has there been any update from tech support. Usually I can just cold boot the blade, and I can bring it back online without an issue, so I've been avoiding calling tech support.

jftwp · ‎12-03-2007

For what it's worth, I've since UNinstalled just the HP Storage Agents component of the SIM agents, and problem has not re-occurred yet. (Sssshhh, my ESX server will hear you).

dpomeroy · ‎12-03-2007

I have a flaky ESX server that has gone months with no problem and then just crashes. What out for the vendors saying its not a hardware problem just because none of the red lights are on and their diagnostics don't find anything. I have seen numerous hardware problems that were not detected by the monitoring agents, vendor diagnostics, little warning lights, etc.

Don Pomeroy

VMTN Communities User Moderator

All

ESX 3.0.2 host spontaneously & completely unresponsive, yet physically running/hardware fine?