Hardware - Cisco UCS with Nexus Switching
Storage - Clustered NetApp (3220 with flashcache) with multipath iSCSI backend
VMware - ESX 5.1
NIC Driver - VMXNET3
OS - Windows 2008 R2 fully patched
I know this topic has been covered ad nauseum - I've been reading articles for days. I'm mainly wondering if I am missing something. When I do a file copy between two Windows Server 2008 R2 VMs (both in the same vmware cluster) my copy initially gets around 150-300MB/s but eventually drops to around 30-35MB/s. I suppose my main question is why the initial speed is so high but eventually drops to a fraction of the initial rate.
So far, I have tried the following :
Am I missing something obvious here? I know ultimately we are limited by the backend iSCSI performance as all folders I'm copying between are ultimately tranferred to and from, but I know 150-180MB is possible based on many different performance tests with IOMeter, LanSpeed, etc. Jumbo frames are not an option at this time either.
This seems to be a Windows buffering issue to me since initial copy rates are so much higher in the first 30-60s of the transfer, but I can't put my finger on the ultimate issue.
Thanks for any suggestions!
Here is the initial copy rate - the est time to finish is < 60s, but the rate eventually falls to ~30-40MB/s so the copy will take ~3-4min
Are you doing any traffic shaping on either the vSwitch hosting the iSCSI vmk or the VM network?
What do the resrouces inside the Guest OS look like? How about esxtop on the hosts that are involved in the copy... maybe look at your disk adapter, disk device, and disk VM stats and look for latency indications or anything else that might jump out at you.
If you think it is buffering in Windows causing this behavior, try xcopy /j, which will execute your copy unbuffered.
No, its a basic vswitch with no features or traffic shaping. Very simple as far as vsphere is concerned. Very complex everywhere else
EDIT : xcopy /J took longer, so Windows buffering may not be a problem. 10GB copy took ~7min, about double the time it took to do a copy/paste from the other server (using SMB)
What types/sizes of files are included in this copy? Maybe toward the end it is hitting lots of little files and is slowing down because of the overhead?
Good question - the copy is approx. 10gb total with about 10,500 files. The number of files could very well be the underlying issue, as a network test utility on this server can read/write a single 10gb file in about 75s.
So perhaps the copy is running as fast as possible with regards to the hardware and NIC config. A secondary question would then be, is this as fast as a Windows 2008 copy can be with this many files? Is it simply the number of files that is causing the slowdown? Interesting (sort of)
I just copied 9 files totalling 2.64gb and it took about ~19s, so that was an avg of almost 140MB/s .. much more in line with what I think I should always be seeing. So JDPTECHNC, you could be spot on with your assessment re: the number of files.
I always knew you'd get worse throughput with a higher number of files but I didn't expect it to be such a performance hit. I wonder if there is anyway to improve the copies now the slow down is ostensibly due to the number of files and not to the networking config.
Thanks again for the assistance JDPTECHNC!
PARTIALLY SOLVED - copy was previously taking around 4min so on a hunch I turned off Symantec A/V and the time to copy went down to 1:50 - avg rate was approx. 91MB/s. Not as high as doing large files where I've seen 140MB/s, but a lot better than before. I assume we have realtime protection within SAV, causing every file to be examined. I'd still like to speed up copies with large numbers of files, but this at least helps.