VMware Cloud Community
Goatiiee
Contributor
Contributor

FIXED::: ESXi Host Locks Up - All VMs down - Workaround is rebooting Host

My appologizes if this has been covered before, but I have not seen this particular issue in the previous discussions.

Our client has a new Dell T610 server with ESXi 4.1. We migrated 9 servers over (P2V) and have since then experienced a number of lock-ups for the host server itself. All servers are 2003, one is 2000 and one is just an XP machine. These lock-ups started to occur even when there were only two VMs on it.

One thing we've noticed is that when the server is running OK, inside vSphere under Configuration > Health Status > everything is green and great. There is a lot of device information about Controller 0. When the host server locks up, there is an Alert under the Storage group with a device named "unknown" and there is NO information about the controller listed at this point in the list. Basically we have to log into the DRAC and do a warm reset as shutting down the virtual machines stops at 95%.  Even doing an F12 to restart from the console does not complete during the lock-up.

While the server is locked up, I try to Export the System Log files. It runs for a few minutes, then it says it is completed and at the same time my session disconnects from the host. The log files are not complete as they are only 1.5MB on average. When it works normally, the log files have been more around 19MB.

Initially, we thought when it first happened that it may have to do with our clients backup service, Iron Mountain, that backups up 3 of the servers remotely (all three are 2003 servers). One of the servers is backed up every 15 minutes - this is something that was setup by their original IT guy. It appeared that when the Micorsoft Shadow Copy service and Volume Shadow Copy services were running, it is when the lockups occured which is what the Iron Mountain services use. But now because of the storage alert, Im not certain if this is the problem (but it's certainly not ruled out).

There isn't any logs reporting errors in Open Manage or in vSphere or in VEEAM monitoring (free version).

Dell is looking at the DSET logs but have not seen any issues at this time.

All drivers, firmware, and software are up to date.

Any ideas of what may be going on and what else to look at?

SOLUTION:

This was resolved some time ago but in case anyone else has similar issues, the problem was the RAID controller was bad and had to be replaced. Dell would not replace until they were certain it was the cause since we could not get logs to prove it was the controller. We were never able to get the logs to verify this was the problem but we got Dell to send out a new controller anyways with some pressure. Never locked up again since it was replaced.

0 Kudos
1 Reply
Goatiiee
Contributor
Contributor

replaced RAID controller

0 Kudos