VMware Cloud Community
mylesw
Contributor
Contributor

Advice needed: Resolving ESXi 3.5 Hypervisor failure

Over the past few weeks we've had a couple of incidents where one of our VM Host machines running VMWare ESXi 3.5 failed. The result of this lockup was that 2 of the 6 guest VMs on the machines would go offline, and the other 4 would either continue running or run with errors. In both cases, rebooting the hypervisor resolved the issues, but considering that I've seen this happen twice in about 10 days, I'm looking for some suggestions on how to debug and correct it.

I cannot use ESXi 4 as this is a 32 bit host machine only, so I'm staying with 3.5. I believe that the machine is not running the latest updates, but I was unable to connect to it with the Virtual Infrastructure Update software. On attempting to connect to this host, I get the error back "The VMWare Infrastructure Update service could not manage one or more of the selected hosts". Strangely I have a 2nd VM Host with the same hardware on it, and I can connect to it and manage it with no issues.

When the system failed, I was able to SSH into it. I tried to look at the file system and found that I was getting errors doing a ls on it. It looks like the file system might be corrupted, but on rebooting it all works fine. I am able to backup the VMs to an external machine, so I'm not worried for their health. But its the hypervisor itself that I think needs some TLC.

Anyway specific questions:

1. How can I see any log messages of problems that might be showing up so I have a better idea of how to resolve this?

2. How can I apply an update to this machine without using the Virtual Infrastructure Update software? Can this be done with a CD or something like that?

3. If the problem is a hard disk failure, is there a way to manually run the equivalent of a FSCK on the local drives attached to this server?

I have a replacement hardware box coming in so that I can migrate all VMs to it and run it as a replacement unit in the next few days. That alleviates the urgency of all of this, but once it arrives and I've moved this box out of production I'd like to get it back online but want to be sure I have a handle on this problem.

Thanks in advance for any suggestions.

Myles

0 Kudos
1 Reply
mylesw
Contributor
Contributor

Also to add to this, the affected ESXi host shows as:

VMWare ESX Server 3i, 3.5.0, 169697

Hope that helps.

M

0 Kudos