cdub
Contributor
Contributor

ESXi 5.1 VMs slowly degrade with disk latency issues until unusable. How do I troubleshoot/resolve?

Jump to solution

Dell R710 servers

Dell Perc 6/i raid controllers

3 500gb hard drives in raid 5 config

2 data stores. 

VMs 3 Red Hat linux 5.5 servers.  Running database intensive applications with high I/O.

Symptom: Servers become unresponsive or very slow over time. Individual VM reboots do not help. A reboot of the entire ESXi server provides immediate relief and the process starts all over again. I can only run the servers about a week before I need to reboot again.

Disk latency can average in the 100 to 300 range before a reboot. After the reboot the latency averages in the 5 to 9 range and slowly grows as the days go on.

CPU and memory usage look normal to me. Consistently flat even when the servers seem to be struggling.

Any ideas what would cause this or how to fix? 

Tags (1)
0 Kudos
1 Solution

Accepted Solutions
CoolRam
Expert
Expert

Please go to esxtop to the server and Press D to see the disk latency.

FollowESXTOP - Yellow Bricks

to troubleshoot latency.

If you find any answer useful. please mark the answer as correct or helpful.

View solution in original post

0 Kudos
7 Replies
CoolRam
Expert
Expert

Please go to esxtop to the server and Press D to see the disk latency.

FollowESXTOP - Yellow Bricks

to troubleshoot latency.

If you find any answer useful. please mark the answer as correct or helpful.
0 Kudos
cykVM
Expert
Expert

Have you checked the battery's status (BBU) of the Perc 6/i? Maybe that is nearly dead and switching off the write cache after some runtime and this will get VERY slow.

0 Kudos
cdub
Contributor
Contributor

I have checked the batteries within Dell OpenManage and they look good with no errors. I did order new batteries just in case this is the issue.  My guess is that my Dell Perc 6/i controller is not up to the task of the 3 VMs I have running on the same raid array. New Hardware will be costly. Would I gain any benefit from from creating another raid array on the same server with 3 additional drives and moving one of the VM's to the new array/datastore?

0 Kudos
cykVM
Expert
Expert

The Perc 6 is not one of the high speed controllers, so that might well be the bottleneck. Do you have any traces (errors or warnings) in your logs?

Maybe 2 RAID1 or a RAID10 would give better performance. Are you using SATA or SAS disks?

0 Kudos
cdub
Contributor
Contributor

SATA disks.  Looking in Dell OpenManage logs.  I don't see any errors or warnings. The PERC firmware is currently 6.2.0-0013 and says that newest version is 6.3.3-0002. Not denying that I shouldn't update the firmware but I think my problems are bigger than that. The PERC battery learning cycle has always made these servers unusable for a few hours. We have to kick off the cycle manually every 90 days during planned outages to mitigate any issues during the automatic cycle. I have many other servers that show no issues during the learning cycle. The servers are 5 years old and my guess is that we've always had issues but they are getting more pronounced in the last few weeks.  Not scheduled to be replaced for another year.

0 Kudos
cykVM
Expert
Expert

Also check your hosts logs.

SAS disks are at least recommended for high I/O.

0 Kudos
ashleymilne
Enthusiast
Enthusiast

I have been in a similar situation and the only solution was to move to faster SAS drives, either 10k or 15k and a faster raid controller (although not always necessary) and RAID 10 to get the I/O I needed. In your case the drives are not built for high I/O low latency performance and RAID 5 is certainly not helping things either.

I would look into new drives, perhaps four 600GB SAS in RAID 10. I can't see any solution with the hardware you have, its simply not made for the performance you want/need.