blubdidiblub
Contributor
Contributor

Intermittent Lockups, but no crashes

Dear VMWare Community, I am at the end of my rope here and hope for some advise:

I installed a Debian 6 System with VMWare Server and had a Windows 2008 R2 and a Ubuntu 10.04 LTS Server running on it. Randomly, it locked up for 3 to 4 minutes and then continued. It did not crash, but all VMs and and VMWare Server were unavailable and then they were again, as if nothing happened.

After assuming it might be the Hardware RAID Controller, a new and different model, Adaptec 5405, was installed and configured, however, the problem with VMWare taking a break from time to time persistet.

Next I wiped the whole system and installed ESXi 4.1, no problem. Copied the images onto the ESXi Server and the same problem occured again with several hangs before the everything continues. It hangs sometimes once every 8 hours, sometimes a dozen times each hour. No correlation between certain user activities or cron jobs are apparent.The performance graphs show for CPU and RAM simply gaps after the server carries o.

On my friend's ESXi 4.1 Server the images run without any trouble whatsoever for days. Previously Windows 2008 R2 Server was installed on the machine (before Debian6/VMWare, ESXi) without having any trouble.

The server itself is a Xeon Quadcore with 8GB RAM and all of those components were tested and are fine. There is more than enough processing power available. The chipsep is an Intel 5/3400 series.

0 Kudos
4 Replies
Dave_Mishchenko
Immortal
Immortal

Welcome to the VMware Communities forums. Given that the problem has persisted after your change to ESXi, it would seem to point to some sort of hardware issue. When you were running VMware Server were you ever able to access the host during one of these outages? How about with ESXi now? Are you able to access the server when the VMs appear to be locked up?

I would take a look at the log files. The VMkernel log is /var/log/messages. Does it record a gap as well? Note that the logs files use UTC for timestamps.




Dave

VMware Communities User Moderator

Now available - vSphere Quick Start Guide

Do you have a system or PCI card working with VMDirectPath? Submit your specs to the Unofficial VMDirectPath HCL.

0 Kudos
blubdidiblub
Contributor
Contributor

Thanks for your answer.

I also suspect a hardware issue, however, memtest came up clean, harddrives report the SMART status to be OK and the RAID Controller was changed.

During those outages it is impossible to connect to either a running VM or VMWare Server or ESXi. A few minutes laterVMWare Server /ESXi and all VMs run as if nothing has happened.

The logfiles do not record anything in that time, neither anything suspicious +/- 10 minutes of the outages. The performance graphs show no recording of the CPU load or any other parameter during this time.

It is just as if the the entire server takes a coffee break and then comes back on and acts if nothing ever happened.

I will try to up the log files here anyways.

0 Kudos
Dave_Mishchenko
Immortal
Immortal

It would be interesting to see the logs to see if there is a "gap" as well as if the log records things just as if nothing happened.

Do you have the host set to sync with an NTP server? It'll help ensure the times in the log are accurate.




Dave

VMware Communities User Moderator

Now available - vSphere Quick Start Guide

Do you have a system or PCI card working with VMDirectPath? Submit your specs to the Unofficial VMDirectPath HCL.

0 Kudos
blubdidiblub
Contributor
Contributor

Ok, the problem is solved, here is the deal: The server has an Adaptec 5405 RAID Controller in RAID1 with 2 300GB WD 3000HLFS Velociraptor drives. Those are apparently buggy in RAID mode, causing a data jam from time to time due to higher I/O activity. When the ESXi has to wait for data to be written to the disks and them being busy with themselves, the ESXi simply freezes, and unfreezes when the data was finally written to the disks and the show can go on. Removing the 2 Velociraptors and installing 2 new SAS HDDs fixed the problem. Crazy.

0 Kudos