VMware Cloud Community
aughsydney
Contributor
Contributor

Why does storage vmotion solve my Windows disk IO problem

Hi There

I'm having a problem in part of our vmware environment.

The guest OS is Windows 2008 R2 64bit.

These are the symptoms we are seeing.  An application is having perfromance issues.

I check everything in Windows and the only thing performance monitor comes back with is that the "Average Disk queue length" is > 2 (much greater).  Not only that but the average disk sec/Transfer is also outside tolerance levels.

Microsft also confirm that my disk is the cause of the perfromance problem.

After trying everything (and all parties involved Application guys, SAN guys, ESX guys, Networks guys) I decide to do a storage vmotion.

After the storage vmotion the average disk queue length drops like a stone to below 0.01 and the perfromance of the application returns to normal.

I'm now seeing a similiar problem on the same system (different VM) and I am pretty sure that the storage vmotion will solve the problem.

However as part of my root cause of the problem....I need to know why does the storage vmotion solve the problem?

Has anyone come across a similiar solution and worked out why it solved your problem?

Please remember I've had level 4 on the SAN, ESX, Application and Network.  They have all drawn a blank (that why I've not provided all the information its already been investigated...but if you want more information I'll provide it)

Regards,

p.s. The disks are aligned properly (first thing I checked)

0 Kudos
4 Replies
AureusStone
Expert
Expert

Seems like a SAN issue.

What version of ESX are you using?

What SAN do you have?

What disks?

How is your pathing configured?

What is the load across the links?

What size are your LUNs?

Are they replicated?

How old is your environment and how long has this issue been occurring?  Has it been gradual, or was it just bad one day.

What is the load on the LUNs?  I know you have had your SAN people look in to this issue, but the most likely cause of this type of issue is the physical disks backing the volume is under a high load, or if you have 7200K disks, just under some load... Smiley Happy

0 Kudos
athlon_crazy
Virtuoso
Virtuoso

Since the problem disappeared once you moved it to destination datastore, I think this something to do with your source datastore. Identify which LUN having this issue and try do some checking on your storage latency for that target (<10ms) and compare it with the good one(destination) from vCenter/esxtop command.  Storage vMotion didn't solved the issue. It just that the VMs have been moved to other LUN with better latency.

http://www.no-x.org
0 Kudos
aughsydney
Contributor
Contributor

Hi y'all

Just been cleaning up the mess and will update this posting as well.

Remember I said I had the storage guys (in-house Level 3) and Netapp (Level 4) check out the Netapp filer for any problems.

In all the filer was checked five time....you know what's coming don't you?....

The original problem was reported six weeks ago. A conference call was called immediatley to find out why Windows Perfromance monitor was reporting poor disk IO stats (Average Disk Queue length > 4, Average Disk Sec/Transfer > 0.1 sometimes > 1.0)

Over and over and over and over again storage and IP SAN said it is not our problem everything is fine.(Five different investigations)

Then out of the blue 3 days ago there was another netapp guy on site (this is a huge environment) who....I don't know got wind of the situtation.....to cut a long story short he found that one of the filer interfaces (one of four...must be load balanced) was behaving erratically.

Not only that the same Netapp guy found a remedy ticket!!!!!!! advising that the NIC was disabled five months ago because it was causing issues and six weeks ago some muppet in storage reenabled it!  The exact same day our problems started.

As soon as this interface was disabled (I have performance monitor graphs to prove) the DISK IO improved dramatically.

Thanks to you all who answered.  If ever ever ever ever ever ever ever ever you see stats like the ones I mentioned above.  It's the disk do not start investigating anything else...no matter how bloody confident the storage guys sound...its their problem.

0 Kudos
AureusStone
Expert
Expert

I am glad you have found your solution.

Occam's razor.  If it looks like a storage issue is usually is. Smiley Happy

I hope they have pulled the cable and left a note on it. Smiley Happy

0 Kudos