this year we have migrated a system to new hardware and software. The old system was ESXi 5.5 on a Lenovo x3550 M5 64GB RAM 1x SSD Datastore (RAID10) and 1x HDD Datastore (RAID5). All the production systems were on a single VM which was Windows 2008 R2.
The new system is ESXi 6.7 on a Fujitsu RX2530 M4 64GB RAM with a single all SSD Datastore (RAID10). The production server is now Windows 2016 server. Both servers use Megaraid based RAID controllers, specifically the Fujitsu is a PRAID EP540i and the IBM is a ServeRAID M5201.
The reason for this post is that we are experiencing some issues since upgrading that we were not expecting. Firstly, the most strange issue. Previously (for a couple of years) we were periodically taking a snapshots of the main VM during working hours and never had any issues. Since we moved to the Fujutsu users complain of performance issues when using the system during and after snapshots, and on more than one occasion the whole Windows VM has frozen and we have had to reset the VM (no errors shown from ESX side). We now avoid taking any snapshots during working hours. Another issue is that when running backups (Veeam) the system becomes quite unresponsive at times and we are now avoiding any backups within working hours, previously we didn't see this issue (We are using CBT backups, but often and without error Veeam insists on reading practically the entire VM which is over 2TB in size). Veeam reports backup throughput of about 300-400MB/sec with the source as bottleneck which, while not slow, doesn't seem particularly amazing for a RAID 10 array of 6 SSDs. I'm not worried about the 3000400MB/sec speed as such, just mentioning it in case it seems unusually to anyone else. And lastly, and most importantly, it seems that when the Windows VM uses any page file that users experience general lag in the system. We have spent the last week ensuring that the system fits within the physical memory and have therefore reduced the impact of the issue substantially, but Windows still likes to use some page even when there is a lot of RAM it seems, and in any case, if we have SSD RAID 10 we'd hope that any page usage would be pretty speedy. Worth mentioning we are using the paravirtualized SCSI controller for the VM disks. Also worth noting that currently this VM is the only active VM on the Fujitsu and there is sufficient physical memory for it to run without using swap.
We haven't done any specific tuning with respect to the RAID controller as I didn't see anything specific to Megaraid controllers when having a search for info, so its using the default settings for a RAID array over 6 drives with RAID10.
So basically I'm wondering if anyone has any thoughts, experienced and/or fixed any similar issues. It's equally disappointing to be having issues on newer faster hardware that we did not experience on the old hardware as it is to have the lagging issue on the Windows VM having invested in an all SSD solution. So any input greatfully recieved,
It might be good to check how is your RAID configured, default settings might not be quite OK for the SSD RAID.
General recommendation for RAID10 would be:
RAID type: RAID 1 for 2x SSDs; RAID 5 for 3x and more SSDs, or RAID 10 for 4x and more pair SSD
Disk cache policy: Default (enabled by default)
Write policy: Write Through
Read policy: No read ahead.
Stripe Size: 64K
Disk Cache Policy: when enabled, allows writing to the cache of the disk prior to the medium
– For virtual disks having SATA disks underneath, this policy is ENABLED by default;
– For virtual disks having SAS disks underneath, this policy is DISABLED by default.
Thanks for the reply. I had a check through and we have read ahead enabled, although I can´t imagine this would have a drastic effect on performance. We tried changing this, and we also tried setting write back cache (despite many people saying this will hurt performance). Strangely the VM would hang while booting with either of these options enabled, which should be completely invisible to ESXi and the VM, they should only notice changes in performance. Very strange, we didn't have a long window in which to do much testing, we were mostly just worried about getting the VM to start at all!
It feels like possibly a hardware issue with the RAID controller, but on the other hand ESX doesn't complain about access to the disk which I would expect if there were a hardware fault. We are going to open a ticket with VMware, let's see how we get on :S