VMware Cloud Community
erietveld
Contributor
Contributor
Jump to solution

High IO latency from simple file copy?

A simple file copy on the local C: disk, from one folder to another, on a Windows 2008 R2 virtual machine causes disk latency (DAVG/wr) to go up to 300 ms. If I give the virtual machine another drive, D:, that is on another LUN, a file copy from D:\ to C:\ even makes DAVG/wr latency go up to 1500ms. The high write latency is measurable on other virtual machines on the same LUN.

The same file copying activity on a Windows 7 virtual machine on the same LUN leaves disk latency (DAVG/wr) below 20ms.

Latency is measured with esxtop on the host and iometer inside guests. During my tests there were no other virtual machines running.

Is the disk latency *supposed* to go so high from a simple file copy? It would make me uncomfortable if somebody copying a large file to another folder on the file server could blow up write latency for all other virtual machines too. Or is it not supposed to, and I have misconfigured something?

Our setup:

Server is HP Proliant DL165 G7

SAN is HP MSA P2000i G3

ESXi 5.0 Driver Rollup 2

Server has 4 gigabit ethernet cards. vmnic0 is connected to a switch, here the management and virtual machine networks are connected. vmnic1 is not used. vmnic2 goes to port A0 on the SAN. vmnic3 goes to port A1 on the SAN. Controller B on the MSA has been shut down. The SAN has two LUNs. Using "Manage paths" I have disabled path vmnic2-A1 for Lun0 and vmnic3-A0 for Lun1, so each LUN has a dedicated cat6 cable. Both the Windows 2008R2 and the Windows 7 virtual machines were installed to Lun0.

0 Kudos
1 Solution

Accepted Solutions
rickardnobel
Champion
Champion
Jump to solution

Thanks, and strange that this behavior has not got any attention. It does look it could cause some odd results on shared storage if many VMs are competing for the same disk systems.

It might be that in reality is this kind of large IOs often not possible, there must really be 32 MB of continuous free disk space available.

My VMware blog: www.rickardnobel.se

View solution in original post

0 Kudos
29 Replies
rickardnobel
Champion
Champion
Jump to solution

erietveld wrote:

Is the disk latency *supposed* to go so high from a simple file copy?

No, it is really very high numbers you get. Do you only observe this on the Windows 2008 R2 server and not on any other VMs?

My VMware blog: www.rickardnobel.se
erietveld
Contributor
Contributor
Jump to solution

Yes, it happens in this manner only on the Windows 2008R2 machine, and not on the Windows 7 machine. I have thrown out the vms and installed fresh a number of times.

I can also create very high latency in esxtop on Linux virtual machines by writing directly to the disk like so:

dd if=/dev/zero of=/dev/sda bs=1M count=5000

This causes DAVG/wr to go above 300ms or higher in esxtop. A simple file copy on Linux does not cause high latency.

0 Kudos
rickardnobel
Champion
Champion
Jump to solution

I am not familiar with the specific SAN you have, but could there be issues with write caching? Does it have a battery-backed cache enabled?

Lack of such (or not configured) could cause very slow write times.

My VMware blog: www.rickardnobel.se
0 Kudos
erietveld
Contributor
Contributor
Jump to solution

The SAN has write caching enabled. "Battery-free cache backup with super capacitors and compact flash"

The console of the SAN shows no warnings or errors. I had the vendor (HP) check for hardware issues. We even replaced the controller. This did not help.

I see no write latency issues when attaching the LUNs to a physical machine (ie, run something else instead of ESXi).

Copying a file from one LUN to another in VSphere client also shows no issue.

0 Kudos
rickardnobel
Champion
Champion
Jump to solution

Do you see anything else strange while doing heavy disk activity - like high CPU on the specific guest or on the host?

What kind of network usage do you see on the vmnics?

Which scsi controller type are you using in the Windows 2008 machine?

My VMware blog: www.rickardnobel.se
0 Kudos
erietveld
Contributor
Contributor
Jump to solution

I haven't noticed anything out of the ordinary, but that doesn't mean nothing is. During the file copy, "explorer.exe" has 25% CPU usage on the guest, which is unusually high for a copy operation but does not seem to indicate a bottleneck. The Windows2008R2 server has only one virtual CPU assigned.

On the host, during the file copy the CPU usage spikes up to about 8% from 1.5% average.

During file copy from C: to 😧 (Lun0 to Lun1), the data receive on vmnic2 goes to about 50 MB/s and the data transmit on vmnic3 goes to about 50MB/s. The seems to be no other noticeable activity.

On the Windows 2008R2 virtual machiyne, in "Device Management" -> "Storage Controller" the following device is listed:

LSI Adapter, SAS 3000 series, 8-port with 1068

If that is not what you meant with "Which scsi controller type" please tell me how to find that out.

It's a fresh install from DVD, I haven't installed anything or made any changes except configuring the network, and installing iometer.

0 Kudos
rickardnobel
Champion
Champion
Jump to solution

erietveld wrote:

During file copy from C: to 😧 (Lun0 to Lun1), the data receive on vmnic2 goes to about 50 MB/s and the data transmit on vmnic3 goes to about 50MB/s. The seems to be no other noticeable activity.

So there is low CPU usage, so it should not be the issue. The 50 MB/s you see, are this really MB (as in Megabyte) or is it Megabit? If it is MB then it is still an acceptable throughput, but not if megabit.

On the Windows 2008R2 virtual machiyne, in "Device Management" -> "Storage Controller" the following device is listed:

LSI Adapter, SAS 3000 series, 8-port with 1068

If that is not what you meant with "Which scsi controller type" please tell me how to find that out.

It could be seen from vSphere Client on the VM, check the SCSI controller type. However, it is most certainly "LSI Logic SAS", which is good and should not be the issue either.

My VMware blog: www.rickardnobel.se
0 Kudos
J1mbo
Virtuoso
Virtuoso
Jump to solution

What is the value of Disk.SchedNumReqOutstanding in the host advanced settings?

0 Kudos
erietveld
Contributor
Contributor
Jump to solution

Indeed the SCSI controller  is LSI Logic SAS.

For the data transmit/receive, the unit listed is KBps (in that capitalization) and the value is above 50000.

Windows 2008R2's file copy dialog reports 45 MB/second transfer speed (in that capitalization).

Since it is a cat6 gigabit link, and the array can easily handle more IOPS, I would have expected it to be capable of twice that, but I am much more concerned about the high latency than the throughput.

The value of Disk.SchedNumReqOutstanding is shown as 32. (I have not changed any advanced setting after installing ESXi Driver Rollup 2)

0 Kudos
rickardnobel
Champion
Champion
Jump to solution

erietveld wrote:

For the data transmit/receive, the unit listed is KBps (in that capitalization) and the value is above 50000.

Windows 2008R2's file copy dialog reports 45 MB/second transfer speed (in that capitalization).

Since it is a cat6 gigabit link, and the array can easily handle more IOPS, I would have expected it to be capable of twice that, but I am much more concerned about the high latency than the throughput.

It is some decent throughput, but as you say the latency values are way too high and will likely affect performance a lot.

Could you do some esxtop screenshots while doing file copies? The screens from d, u and v.

My VMware blog: www.rickardnobel.se
0 Kudos
J1mbo
Virtuoso
Virtuoso
Jump to solution

As mentioned on another thread latency is just the product of queue depth and transaction time against the number of drives.  Can you provide some info about the array?  For whatever reason, this 2k8R2 VM is just saturating it's controller queue.

Basic file handling does though seem to be a problem with 2k8 and R2 - only last week I came across a situation where Win2k8 (not R2 in that case) will agressively cache file data to the point of exclusion of quite literally everything else (this is demonstrable on both physical and virtual installs).

0 Kudos
erietveld
Contributor
Contributor
Jump to solution

the top one is the show disk-statistics command on the array

the other 3 are esxtop in d, u, and v mode, respectively.

If you need more sampling points, please let me know. This screen was taken towards the end of the file copy (in the last minute), but the latency was consistently above 300ms, and often as high as 1000ms. The 3 esxtops are not exactly in sync, but they are within 1 second of each other.

@J1mbo: unfortunately, I am not experienced enough to know what I can tell you about the array that would be interesting for you to know. Can you be more specific to what I should tell you about the array?

It's a HP MSA P2000i G3, with 12 15krpm 600GB sas drives

8 are Hitachi HUS156060VLS600

4 are Seagate ST3600057SS

0 Kudos
rickardnobel
Champion
Champion
Jump to solution

Thanks for the esxtop screens.

Just some questions, the vmhba36, this is the software iSCSI adapter I guess?

Could you also provide a "n" esxtop screenshot while doing file copy?

As for the SAN, do you know how the two datastores are physically configured? That is, how many disks and what RAID level?

My VMware blog: www.rickardnobel.se
0 Kudos
rickardnobel
Champion
Champion
Jump to solution

Some comments on the ESXTOP data so far:
.
The "d" screen:
.
Around 90 IOs per second, half reads and half writes. No kernel latency for the IOs, only device latency. Almost decent read times, around 25 ms but very high write: 438 ms.
.
The "u" screen:
.
52 read commands per second, a bit strange to still get around 52 MB read/s.. Very very large IOs?
43 writes/s to the other LUN and 20 active commands, that is "on the fly". This also indicates that the writes are slow, since you both have 20 commands outstanding and it takes some 400 ms for each to complete.
.
The "v" screen:
.
Only the fitw02 VM is doing any disk activity, so there should be no other disturbance from these. Are there any more ESXi hosts that are connected to the same SAN?
.
Have you tried reading something and writing it back to the same Windows partition? That is to just involve the first LUN and throw both read and writes at it? And then try the same but on the second LUN? It could be interesting to see if they perform the same.
My VMware blog: www.rickardnobel.se
0 Kudos
erietveld
Contributor
Contributor
Jump to solution

Yes, the vmhba36 is the software iscsi adapter. Attached is esxtop n screen during file copy.

Lun0 and Lun1 are both 6 disk RAID6 arrays. Earlier, I have tested with 12 disk raid 0 array and still got latency above 300ms, but I cannot reproduce currently as I don't have free disks. SAN vendor (HP) has walked me through a long troubleshooting prodecure and has insisted that the problem is not in the SAN array or the current configuration of it.

Copying on the same partition gives latency levels above 300ms and same transfer speed (45 MB/second)

Both LUNs perform the same.

There are currently no other hosts connected to the SAN.

0 Kudos
rickardnobel
Champion
Champion
Jump to solution

Everything looks quite good on the network view too. I see that both iSCSI vmnics are used and none of them have any real load either. No dropped packets.

RAID5 and RAID6 do have some write penalty, both nothing like you are seeing. Are you sure that the cache settings are ok? I am sure you have verified this, but could there be something with the write-thru/write-back settings that is incorrect?

Could you by the way test some disk performance tool, like IOmeter or other, and try to only do reads or only do writes and see what the result is?

My VMware blog: www.rickardnobel.se
0 Kudos
erietveld
Contributor
Contributor
Jump to solution

So far as I can verify, the caching settings are OK. The inferface tells me caching is enabled, there are no warnings or errors, and a HP support engineer has ensured me that the caching settings are correct. Also, connecting the SAN to another server, like a linux host instead of ESXi, we can write with 100MB/s throughput and low latency. The latency problem does not reproduce when we copy a file from one LUN to another in VSphere client, nor when we copy a file inside a Windows 7 virtual machine.

How should I configure IOMeter to do a proper test?

When I configure it to have one worker, to do 16K writes (0%read, 0%random), and allow it to have 8 outstanding IOs, it writes with 45MB/s throughput and 3 ms latency on the Windows 2008R2 server. The same figures are reported by iometer as I can see in esxtop. If I allow it to have 32 outstanding IOs, the throughput reported is 55MB/s, and latency goes up to 9 ms. Again iometer and esxtop agree. Doing only 16K reads, and allow 32 outstanding IOs, it reads with 110MB/s and 4 ms DAVG/rd.

0 Kudos
rickardnobel
Champion
Champion
Jump to solution

I wonder if the Windows 2008 R2 server is using some really really large IOs? Which could cause this extreme latencies. We saw only 50 cmds per second, and at the same time around 50 MB moved around..

Could you check the IO size for both read and write while doing transfer:

Avg. Disk Bytes/Read

Avg. Disk Bytes/Write

on the Physical Disk section in perfmon. (http://rickardnobel.se/archives/220)

My VMware blog: www.rickardnobel.se
J1mbo
Virtuoso
Virtuoso
Jump to solution

Also, is the 2k8r2 VM a domain controller?

0 Kudos