Eudmin
Contributor
Contributor

esx 3.5 U1 host repeatedly not responding, suggestions for diagnosis techniques?

This is getting more frequent. It happened two weeks ago, then 1 week ago, and now 3 times in the last 3 days.

I get a message in virtual center that just says "Host <IP> is not responding." When I go to the console of that server I can see the service console, and can type my login name, but I am never prompted for the password. Sometimes after getting that message in VC I can log in to one of the guests (a linux one) via ssh and can even do "top" in that guest, but directory listings fail. When I reboot the ESX server it comes back up and when I look in /var/log/messages I don't see any messages at all that would indicate what is failing. I can tell exactly when the failure occured because /var/log/cron shows the hourly jobs. Also I can tell because I've configured Virtual Center to email me when it gets an alert that the host isn't responding.

The server is a HP BL460c, and I'm not seeing any health warnings in HP SIM, but I'm not an extremely experienced user of that tool, so perhaps I'm missing things. I installed the HP Agents for ESX 3.x that you can download from their site. They installed without incident and I can see the System Management Homepage on the server which asserts "no failed/degraded" items. No logs so far are giving me any insight about what is happening.

Can someone give me a suggestion for where to look to start diagnosing this problem? I'd like to do some self-checking before getting on the phone with HP support. I'm considering transferring the VMs off to my other server (I only have 2, both BL460c's and the other one's working fine as an ESX 3.5i installable). After transferring them I'd run memtest for a good long time to see if I can get the thing to stop responding.

0 Kudos
36 Replies
lamw
Community Manager
Community Manager

This is the equivliant to Unix/Linux "/etc/hosts" file you have IP Address to FQDN to shortname layout, on XP it's located at "C:\WINNT\system32\drivers\etc", it should have a similiar path for win2k3 and you'll find a hosts file. Again, this fix was only done becusae of our primary sub-domain with the top level domain which we don't really use, but I guess it's pushed down from AD when we joined the server. This oddity created some instability in the lookups, so by adding this value into the hosts file on the VC Server, it actually fixed some issues and stablized the envrionemtn, but I think another part of it was bringing our ESX to a specific patch. Good luck and careful about your changes until you're sure =]

0 Kudos
Eudmin
Contributor
Contributor

Rparker,

After I restarted the server again I did what you suggested. Mostly. There was no user on the system called vim manager, but I did delete vcsa and vpxuser. There was a vimuser. Was that the one you were thinking of?

Anyway, then I got rid of the stuff in the ssl directory and restarted the mgmt-vmware service. It looks like vpxuser got readded, but not vcsa which had UID 69 previously. Hope this one isn't important. 😎

lamw,

I added the entries to that host file and restarted VC. I'll watch it to see if it works.

In the meantime, I still have the question of where I should look for logs to help diagnose what's happening. I appreciate all of the suggestions, but they are just kind of trial and error solutions to the problem of my ESX server just hanging and not responding on any interface (service console, VIC, or SSH).

0 Kudos
Eudmin
Contributor
Contributor

And it just went down again. This is crazy. I can't even keep it responding long enough to transfer the VMs off of it.

0 Kudos
lamw
Community Manager
Community Manager

Is this occuring only one one of your ESX Server or your only one? I would double check with your network team to make sure nothing is going on ... it almost sounds like the symptoms of IP Address already being in use from a Windows system. The next time you can't get to your system, see if you can just ping it and see if it responds, I have a funny feeling it might be some rogue Windows box stealing your IP or vice versa. I would probably try to connect through a console DRAC/iLO if it's located nearby, I know it's not always the case and the server(s) might be at a remote site. Though for troubleshooting with so much disconnect, I would just console in directly if possible.

0 Kudos
Eudmin
Contributor
Contributor

I have two total. It's just occuring on 1. Actually, the other one is ESXi, not ESX, so maybe you could say it was happening with my only one.

I agree, it could be one of those IP address in use things aside from the fact that when I connect to the console using iLO I can see the service console, I type my username, and then am not asked for the password. It's completely hung.

Actually, is it possible that this is what happens if you fill up all of the disk space for VMFS? I was watching the space via Virtual Center and it alleged that I still had 26GB left. I'm using the on-blade storage with 60GB initially. If it reports the free space wrong and doesn't keep track of, for instance, space taken up by snapshots then would the disk just fill up with no indication in the log and hang the server because all writes would fail?

0 Kudos
lamw
Community Manager
Community Manager

Well next time that you can actually connect, check out "vdf -h" verify that you're not filling up your root parition or VMFS volumes, that would cause either your ESX Host to crash or VMs to. At this point it's very hard to troubleshoot your issue, theres so many unknown's with your single ESX Server, possiblity of filling up the storage? IP conflicts? .... I would recommend opening an SR with VMware and see if support can guide you via webEx session. Hangin of the prompt leads me to think something with the network/dns, but it's almost not responding in VC, but that could just mean your managment agent is hosed. Are the VMs still functional? Are these on the SAN or local storage? If they're on the SAN I would say reboot this guy, set it in VMware maintenance mode first chacne you get and troubleshoot from there, even possibly going into recovery mode to see if the networking is okay.

0 Kudos
Eudmin
Contributor
Contributor

lamw,

Sorry it's been so many different possibilities. I don't think any of those are the cause, however. The disk isn't full, there's no IP address conflict. Name resolution seems to be working fine. We aren't using a SAN. Our arrangement isn't really that complicated. I've got 4 VM's running on an HP blade which has 4GB of RAM and two 73GB hard drives set up in a RAID1 mirror. I have them all stored on the local disks, and they are very low utilization Linux servers with 8GB virtual disks and 512MB of memory each.

When the server begins to hang here's what happens:

- Logins to the console of the ESX server through integrated Lights Out don't work. I see the console, but can't log in. This should have no relation to any problems with the management agent or the network settings.

- The Virtual Center server eventually notices in a few minutes and says it's not responding.

- The virtual machines running on it will respond to pings, so they're kind of working, but they do little else.

- Existing SSH logins to the ESX host aren't booted off, but I can't execute anything or read any files

- Existing SSH login to the virtual machines aren't booted off, but it's the same story as with the ESX host. I can't execute anything or read any files

- Looking at the thermal data using iLO for the blade seems to indicate that everything is fine

- Upon rebooting the server I can check the array status of the RAID disks, and it says that they are fine.

- Upon rebooting the server I can look at log files after SSH-ing to the ESX server, but there are no log entries from during the time it is hung, that I have noticed in the usual places on Linux (/var/log/messages, dmesg, etc)

- The hangs are getting more frequent. It happened once, then two weeks later, then a week later, then a few days later, and then 4 times yesterday.

I did transfer the VMs that I wanted to be running to my other blade, so there were no VMs running on the malfunctioning host. It was doing nothing with VMs all night, but it still hung at 1am according to the email I got from VC.

I'm leaning towards a hardware problem with the blade. Maybe some kind of memory or CPU error that's getting worse. If it was a desktop I'd suspect the power supply too, but since these things are in an enclosure I'd expect power problems to affect both of the blades equally, but the other blade has been solid for 3 months.

I'm grateful for the many suggestions in the thread, but are there no

useful logs that can be checked when things are going south? I've been a redhat user and admin for 12 years, so I'm pretty familiar with the

normal places to look for problems in Linux, but those aren't giving me

any clues on this ESX server. The log files are just blank after the time that the problem occurs.

0 Kudos
lamw
Community Manager
Community Manager

Yea I would say this might be hardware related problems. I guess it's not bad enough yet to give you PSOD, can you install HP SIM Agent to monitor the hardware. I think the best route to go is to see if you can get a hold of their smartstart CD and run a diag. across all your hardware and possibly memtest86 and run it for 72+hrs, possibly either the smartstart CD OR memory testing will give you a precise error. If not, then I would suggest in contacting HP hardware support and have someone come out to help troubleshoot the issue, I know they may not always helpful but it might work. I don't think it's on the ESX side of things and usually when they poop, they'll produce a PSOD (purple screen of death) and either link it to CPU or Memory. Also check /root or /tmp to see if you have any vmkcore dump files, this can also point to hardware faults. If you can, see if you can run "vm-spport" and open a support case with VMware they can try to parse through the logs in detail and might be able to tell you exactly whats going on, it not only looks at the general logs, but also proccess that were running and other config files. I think this is your best bet, as everything points to hardware faults and not ESX.

0 Kudos
richard6121
Contributor
Contributor

Are the service console NIC's teamed? If so, unteam them and just run the svc console with one NIC for a little while. Let me know if this has any impact.

0 Kudos
Eudmin
Contributor
Contributor

Nope, not teamed. I do have 2 NIC's in the server but only one is connected. I just did the simplest config of ESX. The problem isn't just that it stops responding on the network. The problem is that is stops responding. Period. Even on the console, the actual console, the one where you have to walk up to the computer, not the "ESX service console" which is actually redhat Linux.

In response to lamw,

24 hours of memtest running and no errors were detected. I also booted Knoppix and have it doing some cheap calculations, but ones that will get the CPUs nice and hot and no errors so far. I do have the HP SIM agent installed, but the SIM agent stops responding along with everything else unfortunately. At least it doesn't give me any useful error messages. HP SIM just says it can't get a response from the agent. I wish the iLO web page were a little more powerful. 😎

0 Kudos
mjiang
Contributor
Contributor

Have you made any progress in finding the root cause ?

I've been plagued by a similar issue since June. 4 x DL585G2 have been running happily for more than a year then out of blue start to play the freeze game in turn. Different date, different time, same symptom. Tried upgraded all from 3.0.1 to 3.5, stabled for a month then start again.

Due to no trace in any log, VMware support looked at it serveral times with no conclusion. They tends to blame either the hardware or power, but again no trace in HP hardware log indicating any issue. Burn-in ran as well with no result (matter of fact all 4 are always loaded during the day so can say they are burning-in all the time). The power is provided by filtered Data Centre UPS and hard to be blamed because always only one server froze.

I am totally lost at this stage with customer screaming. Any hint or update on the progress will be appreciated.

0 Kudos
oschistad
Enthusiast
Enthusiast

This is a long shot, but the symptoms you describe with the entire ESX host freezing sound a lot like what you may sometimes get with old firmware for the motherboard, RAID etc. Sometimes when drivers are updated your existing hardware requires new firmware, and there have been cases where for instance a RAID controller running on old firmware causes a total system freeze.

Since it's always good to be on the current firmware revision anyhow you might want to do a complete update of all flashable components in the server and see if that helps.

Eudmin
Contributor
Contributor

No progress. We still have 1 blade out of 2 that hangs repeatedly after running (and doing no guest OS hosting) for a few days. The other blade has been running for 150 days with no reboots. I even installed the ESX 3.5i U2 installable image on it with no joy. Same symptoms. For 3.5i I can see get to the management interface on the console where it lets you set IP addresses and configure the management network and stuff. I can even play around with the network settings, but it just flashes and shows me nothing when I try to view any log files.

Oscuistad, who replied to your message, may be on to something, though. I don't know if it's firmware (because my working blade is running the same firmware), but I'm coming to the conclusion that it's something with the RAID controllers. I really don't think it's the hard drives themselves failing because when the machine is running HP SIM reports no problems with the hard drives. I assume that it's able to read the SMART data from the drive and reliably report drive health, therefore there must be something wrong with the RAID controller.

BTW, memtest ran for a week and got no errors on mine, so it's not a memory quirk either.

Your problem sounds a bit different from mine since all of yours are freezing up. I'm not familiar with that model server, but if they're blades, is it possible that your problem is with the enclosure? Maybe get on the phone with HP support and try to talk their local guys into lending you one or something?

0 Kudos
mjiang
Contributor
Contributor

The boxen are on their way to be flashed with latest firmware for CPU upgrade. I will try as suggested to update all flashable components. Finger-crossed.

But I doubt driver could be the cause since the servers with 3.0.1 have been running quite happily more than a year without any problem. It all began out of blue. Upgrading to 3.5 was actually recommended by VMware support as a fix which did not work.

0 Kudos
mjiang
Contributor
Contributor

One thing I don't understand is that we got quite a few DL585s with all different version of firmware as shipped. Some of them running ESX as well. It'd be extremely lucky (unlucky to me) for the fault only happens on those 4 servers (which happen to belong to the same customer 8-(( ).

Also I reckon if RAID controller is playing up, there gotta be trace somewhere in hardware log.

Sincerely hope the up-coming firmware upgrade could end this nightmare.

0 Kudos
NZ-Noobie
Contributor
Contributor

Hi Guys,

Just wondering if anyone has made any head way into the cause of this error. I am having a similar issue with our ESXi server.

First i guess a little history...

We have 2 x HP blade BL-25p g2's running ESX 3.5, a virtual center server is setup as a vm on one of these ESX3.5 hosts. With vmware suddenly giving away ESXi for free we thought we'd try it so i d/loaded the iso and set up ESXi 3.5 update 2 build number 103909 - all was good and fine for about 2 - 3 weeks, then the issue with this build and August the 12th showed up, so i waited patiently for the new release of the esxi installable iso. Once it was released i downloaded it and rebuilt our ESXi server using the new iso (ESXi 3.5.0 update 2 build number 110271) and it has lasted all of a day before i started getting not responding errors only from the ESXi server in our Virtual Center VM Server.

If i reboot the ESXi server it will respond for around 5-10mins and then will go back to not responding. As with the other guys in this thread i am getting all thier issues as well (i.e. it responds to pings, but can't log in or ssh, or connect via the esx client). I have also tried most of the stuff suggested here as well but to no avail.

Can anyone help?

0 Kudos
mjiang
Contributor
Contributor

Guys,

VMsupport now suggest disable/uninstall HPSIM agent from service console or at least use version 7.91 for ESX 3.5.

1. Which version of HPSIM agent are you using ?

2. Could you try disable/uninstall the agent as well ?

It is said that: there are cases reported to them for ESX hang when agent used and the fault does not specific to hardware/firmware.

I will definitely try disable/uninstall/upgrade the agent.

Let me know whether that works for you.

0 Kudos