A simple file copy on the local C: disk, from one folder to another, on a Windows 2008 R2 virtual machine causes disk latency (DAVG/wr) to go up to 300 ms. If I give the virtual machine another drive, D:, that is on another LUN, a file copy from D:\ to C:\ even makes DAVG/wr latency go up to 1500ms. The high write latency is measurable on other virtual machines on the same LUN.
The same file copying activity on a Windows 7 virtual machine on the same LUN leaves disk latency (DAVG/wr) below 20ms.
Latency is measured with esxtop on the host and iometer inside guests. During my tests there were no other virtual machines running.
Is the disk latency *supposed* to go so high from a simple file copy? It would make me uncomfortable if somebody copying a large file to another folder on the file server could blow up write latency for all other virtual machines too. Or is it not supposed to, and I have misconfigured something?
Our setup:
Server is HP Proliant DL165 G7
SAN is HP MSA P2000i G3
ESXi 5.0 Driver Rollup 2
Server has 4 gigabit ethernet cards. vmnic0 is connected to a switch, here the management and virtual machine networks are connected. vmnic1 is not used. vmnic2 goes to port A0 on the SAN. vmnic3 goes to port A1 on the SAN. Controller B on the MSA has been shut down. The SAN has two LUNs. Using "Manage paths" I have disabled path vmnic2-A1 for Lun0 and vmnic3-A0 for Lun1, so each LUN has a dedicated cat6 cable. Both the Windows 2008R2 and the Windows 7 virtual machines were installed to Lun0.
Thanks, and strange that this behavior has not got any attention. It does look it could cause some odd results on shared storage if many VMs are competing for the same disk systems.
It might be that in reality is this kind of large IOs often not possible, there must really be 32 MB of continuous free disk space available.
erietveld wrote:
Is the disk latency *supposed* to go so high from a simple file copy?
No, it is really very high numbers you get. Do you only observe this on the Windows 2008 R2 server and not on any other VMs?
Yes, it happens in this manner only on the Windows 2008R2 machine, and not on the Windows 7 machine. I have thrown out the vms and installed fresh a number of times.
I can also create very high latency in esxtop on Linux virtual machines by writing directly to the disk like so:
dd if=/dev/zero of=/dev/sda bs=1M count=5000
This causes DAVG/wr to go above 300ms or higher in esxtop. A simple file copy on Linux does not cause high latency.
I am not familiar with the specific SAN you have, but could there be issues with write caching? Does it have a battery-backed cache enabled?
Lack of such (or not configured) could cause very slow write times.
The SAN has write caching enabled. "Battery-free cache backup with super capacitors and compact flash"
The console of the SAN shows no warnings or errors. I had the vendor (HP) check for hardware issues. We even replaced the controller. This did not help.
I see no write latency issues when attaching the LUNs to a physical machine (ie, run something else instead of ESXi).
Copying a file from one LUN to another in VSphere client also shows no issue.
Do you see anything else strange while doing heavy disk activity - like high CPU on the specific guest or on the host?
What kind of network usage do you see on the vmnics?
Which scsi controller type are you using in the Windows 2008 machine?
I haven't noticed anything out of the ordinary, but that doesn't mean nothing is. During the file copy, "explorer.exe" has 25% CPU usage on the guest, which is unusually high for a copy operation but does not seem to indicate a bottleneck. The Windows2008R2 server has only one virtual CPU assigned.
On the host, during the file copy the CPU usage spikes up to about 8% from 1.5% average.
During file copy from C: to 😧 (Lun0 to Lun1), the data receive on vmnic2 goes to about 50 MB/s and the data transmit on vmnic3 goes to about 50MB/s. The seems to be no other noticeable activity.
On the Windows 2008R2 virtual machiyne, in "Device Management" -> "Storage Controller" the following device is listed:
LSI Adapter, SAS 3000 series, 8-port with 1068
If that is not what you meant with "Which scsi controller type" please tell me how to find that out.
It's a fresh install from DVD, I haven't installed anything or made any changes except configuring the network, and installing iometer.
erietveld wrote:
During file copy from C: to 😧 (Lun0 to Lun1), the data receive on vmnic2 goes to about 50 MB/s and the data transmit on vmnic3 goes to about 50MB/s. The seems to be no other noticeable activity.
So there is low CPU usage, so it should not be the issue. The 50 MB/s you see, are this really MB (as in Megabyte) or is it Megabit? If it is MB then it is still an acceptable throughput, but not if megabit.
On the Windows 2008R2 virtual machiyne, in "Device Management" -> "Storage Controller" the following device is listed:
LSI Adapter, SAS 3000 series, 8-port with 1068
If that is not what you meant with "Which scsi controller type" please tell me how to find that out.
It could be seen from vSphere Client on the VM, check the SCSI controller type. However, it is most certainly "LSI Logic SAS", which is good and should not be the issue either.
What is the value of Disk.SchedNumReqOutstanding in the host advanced settings?
Indeed the SCSI controller is LSI Logic SAS.
For the data transmit/receive, the unit listed is KBps (in that capitalization) and the value is above 50000.
Windows 2008R2's file copy dialog reports 45 MB/second transfer speed (in that capitalization).
Since it is a cat6 gigabit link, and the array can easily handle more IOPS, I would have expected it to be capable of twice that, but I am much more concerned about the high latency than the throughput.
The value of Disk.SchedNumReqOutstanding is shown as 32. (I have not changed any advanced setting after installing ESXi Driver Rollup 2)
erietveld wrote:
For the data transmit/receive, the unit listed is KBps (in that capitalization) and the value is above 50000.
Windows 2008R2's file copy dialog reports 45 MB/second transfer speed (in that capitalization).
Since it is a cat6 gigabit link, and the array can easily handle more IOPS, I would have expected it to be capable of twice that, but I am much more concerned about the high latency than the throughput.
It is some decent throughput, but as you say the latency values are way too high and will likely affect performance a lot.
Could you do some esxtop screenshots while doing file copies? The screens from d, u and v.
As mentioned on another thread latency is just the product of queue depth and transaction time against the number of drives. Can you provide some info about the array? For whatever reason, this 2k8R2 VM is just saturating it's controller queue.
Basic file handling does though seem to be a problem with 2k8 and R2 - only last week I came across a situation where Win2k8 (not R2 in that case) will agressively cache file data to the point of exclusion of quite literally everything else (this is demonstrable on both physical and virtual installs).
the top one is the show disk-statistics command on the array
the other 3 are esxtop in d, u, and v mode, respectively.
If you need more sampling points, please let me know. This screen was taken towards the end of the file copy (in the last minute), but the latency was consistently above 300ms, and often as high as 1000ms. The 3 esxtops are not exactly in sync, but they are within 1 second of each other.
@J1mbo: unfortunately, I am not experienced enough to know what I can tell you about the array that would be interesting for you to know. Can you be more specific to what I should tell you about the array?
It's a HP MSA P2000i G3, with 12 15krpm 600GB sas drives
8 are Hitachi HUS156060VLS600
4 are Seagate ST3600057SS
Thanks for the esxtop screens.
Just some questions, the vmhba36, this is the software iSCSI adapter I guess?
Could you also provide a "n" esxtop screenshot while doing file copy?
As for the SAN, do you know how the two datastores are physically configured? That is, how many disks and what RAID level?
Yes, the vmhba36 is the software iscsi adapter. Attached is esxtop n screen during file copy.
Lun0 and Lun1 are both 6 disk RAID6 arrays. Earlier, I have tested with 12 disk raid 0 array and still got latency above 300ms, but I cannot reproduce currently as I don't have free disks. SAN vendor (HP) has walked me through a long troubleshooting prodecure and has insisted that the problem is not in the SAN array or the current configuration of it.
Copying on the same partition gives latency levels above 300ms and same transfer speed (45 MB/second)
Both LUNs perform the same.
There are currently no other hosts connected to the SAN.
Everything looks quite good on the network view too. I see that both iSCSI vmnics are used and none of them have any real load either. No dropped packets.
RAID5 and RAID6 do have some write penalty, both nothing like you are seeing. Are you sure that the cache settings are ok? I am sure you have verified this, but could there be something with the write-thru/write-back settings that is incorrect?
Could you by the way test some disk performance tool, like IOmeter or other, and try to only do reads or only do writes and see what the result is?
So far as I can verify, the caching settings are OK. The inferface tells me caching is enabled, there are no warnings or errors, and a HP support engineer has ensured me that the caching settings are correct. Also, connecting the SAN to another server, like a linux host instead of ESXi, we can write with 100MB/s throughput and low latency. The latency problem does not reproduce when we copy a file from one LUN to another in VSphere client, nor when we copy a file inside a Windows 7 virtual machine.
How should I configure IOMeter to do a proper test?
When I configure it to have one worker, to do 16K writes (0%read, 0%random), and allow it to have 8 outstanding IOs, it writes with 45MB/s throughput and 3 ms latency on the Windows 2008R2 server. The same figures are reported by iometer as I can see in esxtop. If I allow it to have 32 outstanding IOs, the throughput reported is 55MB/s, and latency goes up to 9 ms. Again iometer and esxtop agree. Doing only 16K reads, and allow 32 outstanding IOs, it reads with 110MB/s and 4 ms DAVG/rd.
I wonder if the Windows 2008 R2 server is using some really really large IOs? Which could cause this extreme latencies. We saw only 50 cmds per second, and at the same time around 50 MB moved around..
Could you check the IO size for both read and write while doing transfer:
Avg. Disk Bytes/Read
Avg. Disk Bytes/Write
on the Physical Disk section in perfmon. (http://rickardnobel.se/archives/220)
Also, is the 2k8r2 VM a domain controller?