Dell R710 servers
Dell Perc 6/i raid controllers
3 500gb hard drives in raid 5 config
2 data stores.
VMs 3 Red Hat linux 5.5 servers. Running database intensive applications with high I/O.
Symptom: Servers become unresponsive or very slow over time. Individual VM reboots do not help. A reboot of the entire ESXi server provides immediate relief and the process starts all over again. I can only run the servers about a week before I need to reboot again.
Disk latency can average in the 100 to 300 range before a reboot. After the reboot the latency averages in the 5 to 9 range and slowly grows as the days go on.
CPU and memory usage look normal to me. Consistently flat even when the servers seem to be struggling.
Any ideas what would cause this or how to fix?
I have checked the batteries within Dell OpenManage and they look good with no errors. I did order new batteries just in case this is the issue. My guess is that my Dell Perc 6/i controller is not up to the task of the 3 VMs I have running on the same raid array. New Hardware will be costly. Would I gain any benefit from from creating another raid array on the same server with 3 additional drives and moving one of the VM's to the new array/datastore?
The Perc 6 is not one of the high speed controllers, so that might well be the bottleneck. Do you have any traces (errors or warnings) in your logs?
Maybe 2 RAID1 or a RAID10 would give better performance. Are you using SATA or SAS disks?
SATA disks. Looking in Dell OpenManage logs. I don't see any errors or warnings. The PERC firmware is currently 6.2.0-0013 and says that newest version is 6.3.3-0002. Not denying that I shouldn't update the firmware but I think my problems are bigger than that. The PERC battery learning cycle has always made these servers unusable for a few hours. We have to kick off the cycle manually every 90 days during planned outages to mitigate any issues during the automatic cycle. I have many other servers that show no issues during the learning cycle. The servers are 5 years old and my guess is that we've always had issues but they are getting more pronounced in the last few weeks. Not scheduled to be replaced for another year.
I have been in a similar situation and the only solution was to move to faster SAS drives, either 10k or 15k and a faster raid controller (although not always necessary) and RAID 10 to get the I/O I needed. In your case the drives are not built for high I/O low latency performance and RAID 5 is certainly not helping things either.
I would look into new drives, perhaps four 600GB SAS in RAID 10. I can't see any solution with the hardware you have, its simply not made for the performance you want/need.