VMware Cloud Community
Ahraxx
Contributor
Contributor

Progressively degrading data transfer speed - losing my mind!

We are attempting to migrate a bloated 1.5TB Win2k3 VM from
one SAN to another so that we can do some capitol maintenance and
reconfiguration work on the SAN that is currently holding this VM. It being a
long weekend, I thought it would be the perfect time to undertake this project
but I am having major problems with transfer speeds starting out fast and then slowly
nose-diving into the ridiculous. I have been reading posts and articles all
night and I just can’t figure out WTF.

The setup:

Host Server is a Dell PowerEdge 2900 attached via Dual iSCSI
link to a Dell MD 3000i – high-end SAN packed with 15k RPM SAS drives with
RAID.

VMWare version is ESXi 4.0 Enterprise Budle

Virtual Machine in question is Win2k3 (not that it matters)

Destination is either a directly-network-attached 1Gbps Linux
NAS packed with a proper RAID controller and 6G SATA drives.

Another destination that I tried is the host server’s internal
15k SAS disks on its onboard array.

The problem is that no matter what method of moving this VM
I try, the data transfer rate starts out AWESOME, at 95MB/Sec or higher, but
over the course of an hour or so slows down to a crawl – 10MB/Sec or even
slower. Looking at the data transfer rates on the disk read/write or on the
NICs, the drop follows a smooth downward arc. The symptom is the same whether
we attempt to perform a backup job from within the VM’s OS, or whether we try
to “Clone” from within the vSphere Console, or run a backup via VMWare Data
Recovery appliance interface.  The sharpness
of the curve’s slowdown in transfer rate is different depending on the method,
but it happens nonetheless.

When running a backup from within the VM’s OS, the speed
drops slowly, starting out at 80MBps, down to 60MBps within an hour, down to
30MBps an hour after that and eventually levels out at about 8MBps for the rest
of the duration.

When we try to use the Clone function from within vSphere,
the job times out within 2 hours or so but the curve drop is quicker. It starts
out at 95MBps and within an hour is down to 10MBps or lower. I read a post and
found a place where I think I can extend the timeout period, but it seems
idiotic to me that I cannot just turn off the timeout altogether to let the job
finish.

Trying the VMWare Data recovery, the transfer speed drops
MUCH faster and much lower; it starts at about 50MBPS, but nose dives to under
1MB/sec within an hour. Leaving it running overnight, I finally canceled it as
it barely moved 50GB.

I have looked at everything that I can think of to explain
this, but everything looks normal – I cannot find the bottleneck or explain why
the speed drops off like this. I have shut off all other VMs on the system. I
have tried moving the target VM with it off. It seems to make no difference!
All systems – the host, the SAN and the NAS have plenty of resources and none
of them are getting hammered. I’m just at a loss to explain what’s going on.

I read an article about VAAI causing something very similar to
this, but after some research, it seems that VAAI support didn’t make it into
ESXi until version 4.1 and I am on 4.0. I tried to look for the VAAI controls
within vSphere and did not see them. However this MUST be some basic and
fundamental communications/buffer issue and I suspect (given that I have tried
moving the VM to two completely separate destinations) that the problem is
between the $8,000 host server and the $30,000 SAN. Ironically, my home-built
$1000 host server and $2000 SAN works perfect fine!

Suggestions welcome!

Reply
0 Kudos
7 Replies
Ahraxx
Contributor
Contributor

Seriously?? 24 hours, 35 views, not one reply??

Reply
0 Kudos
Ahraxx
Contributor
Contributor

And were up to 62 views and 0 replies... No tips on how to troubleshoot the reason for the bottleneck? There have to be ways to see various caches and buffers. Anyone?

Reply
0 Kudos
Ahraxx
Contributor
Contributor

86 and 0

Reply
0 Kudos
mcowger
Immortal
Immortal

You are probably driving people away with these antics...to be honest.

But you can watch much of the internals with ESXtop...have you tried that.  As far as buffers and caches....ESX doesn't perform any caching / buffering of IO.

Lastly, the MD3000i is a very lowend array, not high end....not sure what you are expecting from it....

--Matt VCDX #52 blog.cowger.us
Reply
0 Kudos
ElevenB2003
Enthusiast
Enthusiast

I'm more of a fiber channel guy but can you verify there aren't any QoS configurations applied anywhere in your iSCSI infrastructure? It almost seems as though it's being throttled down or "policed" after a certain period of time.  What other performance monitoring solution(s) do you have?  As mcowger said: take a look at ESXtop.

What else is running on this SAN? Is it possible something else is chewing up I/O or bandwidth randomly?

Reply
0 Kudos
Ahraxx
Contributor
Contributor

mc,

I am not sure what "antics" you are referring to, but I do see that it took 9 days from time of original post to time of half-assed useless reply. The array is a top-line array for it's class. Yes, datacenters have arrays that cost seven figures, but for a single 3-4U rack array, these are up there. With its current configuration it should be able to push quite a bit more than the 60-95MB/s that I would be content with. In any case, diving down to 8MB/s without any other load is highly abnormal. If that does not compute, there are some whitepapers that can help.

Eleven,

Thanks for your post. As far as I can tell (and maybe I just don't know where to look) there is no QoS going on. Even then, I would expect that QoS would be smart enough to utilize unused bandwidth/capacity when there is nothing else using it. I will put ESXtop on the system and see if it can point us in the right direction. As far as other things on the SAN, there are few low utilization servers on there (print server) but we have monitored them to ensure that they are not chewing up I/O. Also when opportunity allowed, we shut down literally everything else on SAN and tried to move files with the same exact results!

Reply
0 Kudos
ElevenB2003
Enthusiast
Enthusiast

Ahraxx,

             The "antics" that mcowger is referring to is your combative demeanor in your subsequent replies to your own post as well as your reply to mcowger's post offering some advice. Again, I think you should take a look at downloading some trial virtualization/storage monitoring tools - Solarwinds has some great products fully functional for 30 days.  Also, have you been through this guide: http://www.vmware.com/pdf/vsphere4/r40/vsp_40_iscsi_san_cfg.pdf ?

Also, it might be worth taking a look at upgrading to the latest version of ESXi (If it's viable for you to do so).  There have been many, many, improvements in subsequent versions of the hypervisor.

Reply
0 Kudos