VMware Cloud Community
jontackabury
Contributor
Contributor
Jump to solution

All Virtual Machines become unresponsive

Every 2-3 weeks all of the VMs on our ESXi server become unresponsive. I can connect to the host using the vSphere Client, but can't perform any actions. I can't reboot the host, check the hardware status or anything else. None of te VMs can be accessed either. The only way to get the server operational again is to push the power button and bring it all down. Once it's rebooted there aren't any errors anywhere, and everything looks normal. This issue has happened on 2 of our ESXi hosts, both are Dell PowerEdge R310 servers that are only 5 months old. Any thoughts on what the issue could be, or how to troubleshoot it?

Tags (3)
Reply
0 Kudos
1 Solution

Accepted Solutions
keithlammersBFS
Enthusiast
Enthusiast
Jump to solution

Well, this is a really late follow-up, but what ended up solving the issue was replacing our PERC 6i RAID cards with PERC H700 (w/ BBU) RAID cards. Since we did that (over 3 years ago now!) we haven't had these issues at all.

View solution in original post

Reply
0 Kudos
12 Replies
DSTAVERT
Immortal
Immortal
Jump to solution

I would first set a log location so that you retain access to log information after a restart. Logs will have information that may indicate where problems might be. Have a look at the following.

http://kb.vmware.com/kb/1016621
http://kb.vmware.com/kb/1019102

Here is a KB troubleshooting article

http://kb.vmware.com/kb/1007818

What model of disk controller are you using? How many drives and what RAID level?

-- David -- VMware Communities Moderator
Reply
0 Kudos
jontackabury
Contributor
Contributor
Jump to solution

I'll set the log location right now so we're ready for the next hang-up. We're using 2x 1TB SATA 7200rpm drives in RAID 1 on a "Dell SAS 6iR internal RAID adapter". We do a nightly snapshot of all the VMs on each host to a NAS over NFS. All of the hang-ups we have experienced have happened during this snapshot time, except for one. I hope this information helps. Thanks! Smiley Happy

Reply
0 Kudos
a_p_
Leadership
Leadership
Jump to solution

With the SATA disks and the RAID controller (5iR) you are using, this could probably be an "overload" causing the management agents to stop.

What I would do, next time you see this happen is to login to the console and run /sbin/services.sh restart

To access the console, you need to logon to Tech support Mode. Depending on your version of ESXi see the corresponding KB article.

Tech Support Mode for Emergency Support (1003677) or Using Tech Support Mode in ESXi 4.1 (1017910).

André

PS: You can also try to restart the Management Agents from the Direct Console User Interface (DCUI).

Reply
0 Kudos
DSTAVERT
Immortal
Immortal
Jump to solution

You can check the vmware(x).log (anything other than vmware.log which is still active) files for the individual VMs as well as EventViewer in Windows or /var/log in linux to see if there are indicators.

How many simultaneous backup jobs?

Your controller does not have write caching capabilities since there is no available battery backed cache module. Performance is greatly increased with a controller capable of write caching. Make sure there are no big internal VM processes going on at the same time that the backups occur. Make sure both drives are functioning.

-- David -- VMware Communities Moderator
Reply
0 Kudos
jontackabury
Contributor
Contributor
Jump to solution

This error is happening on 2 different hosts, so I don't think it's a failed drive. There is just the 1 backup per-host per-night. I know it doesn't have write caching, and I'm sure it isn't the fastest configuration around, but having a slow controller shouldn't cause VMware to hang-up so badly I have to press the power button on the server.  Smiley Sad

Reply
0 Kudos
DSTAVERT
Immortal
Immortal
Jump to solution

As Andre indicates it is possible to overload a controller. In a virtual environment a write caching capable controller is far more important than on a standalone server. Capture and examine logs for things like errors or resets.

Make sure that all the Dell firmware is up to date as it applies to ESXi. Make sure that ESXi is up to date.

-- David -- VMware Communities Moderator
Reply
0 Kudos
keithlammersBFS
Enthusiast
Enthusiast
Jump to solution

Hi all,

I work with Jon and we've both been stumped by this issue, but I figured I'd add a couple of notes:

We actually have two ESXi hosts with the same configuration Jon mentioned above.

ESXi Host 1 has 5 VM's running on it, and has never had any issues.

ESXi Host 2 has 2 VM's running on it, and it's the one with the issues.

Also note that the VM's on Host 1 have a lot higher utilization than the VM's on Host 2. I should also mention that although most of the time it's happened during the backups, it did occur once in the middle of the afternoon, when the servers aren't doing any sort of outrageous amount of disk work.

The funny thing is, after we moved the 2 VM's from Host 2 to Host 1, Host 1 locked up just the same as Host 2 had been doing for a few months.

Is that just a coincidence? Or is it possible that there's something messed up with one of the VM's from Host 2 that are throwing a wrench into ESXi?

The advice is much appreciated guys, thanks!

Reply
0 Kudos
keithlammersBFS
Enthusiast
Enthusiast
Jump to solution

Another thing noted is that in the Event log for ESXi, this error shows up occassionally:

Lost access to volume {GUID} (datastore1) due to connectivity issues.

That error showed up around 5 hours ago and about 5 seconds after it said it had reconnected, all seems to be ok.

Reply
0 Kudos
LucasAlbers
Expert
Expert
Jump to solution

We had two esxi hosts that were losing disks, it would just stop seeing them.

They were dell r710's.

A bios update resolved this. I would check if a bios update exists for this issue for your model, and at the least ping dell to see if this hardware also exhibits this problem.

Reply
0 Kudos
Generious
Enthusiast
Enthusiast
Jump to solution

Post up the logs first of all located in /var/log/ the one to post would be vmkernel.

What load balancing policies are you using on the storage end? RR/MRU/Fixed?

what version of ESXi do you have installed?

3.5/4.0/4.1/4.1 update1?

Reply
0 Kudos
keithlammersBFS
Enthusiast
Enthusiast
Jump to solution

There actually is a BMC and BIOS update for our server. We're going to do that, and we're also going to install a PERC H700 RAID controller, some extra disks, and set them up in a RAID 10 array.

I'll post an update again in a month or so once we've verified that the issue is resolved.

Thanks for all of your advice and suggestions!

Reply
0 Kudos
keithlammersBFS
Enthusiast
Enthusiast
Jump to solution

Well, this is a really late follow-up, but what ended up solving the issue was replacing our PERC 6i RAID cards with PERC H700 (w/ BBU) RAID cards. Since we did that (over 3 years ago now!) we haven't had these issues at all.

Reply
0 Kudos