Re: Where to start with troubleshooting IO issues?

bbricker · ‎08-09-2009

We are seeing some major IO issues with a new vm running Win2k3 x64 Enterprise and SQL x64 2005 server. This was previously a physical server that was nearly 5 years old (old Dell 6th generation). We setup the VM in what I can only imagine is the most wasteful use of resources, but with the hope that it would be the best performing:

We bought 2 new Dell R710's for the explicit use of running the single SQL vm guest (and a couple other terminal server vm guests) in a vSphere4 ESX HA group. The R710's have 2 sockets of new 3.0ghz quad-core Nehelams with 48GB of ram. They have dual FC HBA's connected to redundant FC switches, connected to a an EMC CX3-20. This CX is an "upgrade" from an older CX300, so we still have a couple older 2gbit DPE/DAE's in the stack which limits the entire back-end to 2gbit. For my SQL server guest vm, which is running two 50GB databases, I setup a RAID10 (6 x 146GB 15K FC) storage group, and a RAID1 (2 x 146GB 10K FC) storage group. Each array is assigned/contains only 1 LUN, 1 VMFS, and 1 VMDK for this single SQL vm guest server. The primary SQL DB's live on the RAID10 and the log/tempDB lives on the RAID1 volume. The VMFS volumes were created using vCenter so my understanding is that they should be aligned. Also the drive/volumes within the vm Windows guests were aligned using diskpart using VMware's best practices guide on partition/guest alignment. Other than this system now running in VMware and running x64 versions of Win2k3 and now SQL 2005, this is pretty much an identical setup to the previous 5 year old physical server in regards to the storage setup. We did go way overkill on the guest processor and memory, which has 4 vSMP cpu's and 16GB memory. The old physical box had 2 old Xeon's procs with hyper-threading and 8GB of ram. We are just trying to improve the performance of this system (Medical EHR) from an end-user experience, and not leave anything to chance.

So we've been testing for weeks and the application runs, for the most part, ridiculously faster from a user standpoint -- screens that took 5-10 seconds to load in our EHR on the old physical box load sub 1 second on the new VM. But anything that causes large SQL table lookups and especially any heavy file operations on the server like doing a SQL backup/restore result in absolute dismal performance and bring the entire VM to it's knees. We did a SQL restore today (reading the SQL .bak file from the same drive as it was restoring to) and it took about 3-4x longer to complete then it did on the old physical box. We have PerfMon running on the console and while monitoring the disk assigned to the SQL DB for % Idle Time, it goes to 0 instantly. It's just getting eaten up completely. While the restore process ran for well over an hour (a task that would have usually only taken 20-30 minutes) I pulled up EMC Navisphere and looked at the statistics of the LUN and it never achieved more than 15MB/s read or write at any given time, combined 30MB/s. I started a near identical SQL restore process on the physical box to try and compare apples to apples and it hit as high as 70MB/s reads and writes, for a combined 140MB/s.

Something is up and I'm not sure where to start looking. Does ESX "throttle" disk IO inside a VM or to a VMDK to keep it from overwhelming an entire VMFS volume? Is there any tuning I can do to it so that this SQL guest vm can essentially use all the available resources - since I am basically doing a 1-to-1 setup of LUN/VMFS/VMDK??

thanks for reading and for your help!

Ben

mcowger · ‎08-09-2009

Use the performance tab and check out your DAVG/cmd time for the LUN the data is on.

What kind of times are you getting?

--Matt

VCP, vExpert, Unix Geek

--Matt VCDX #52 blog.cowger.us

bbricker · ‎08-09-2009

Excuse my ignorance, so I'm not sure I'm following which metrics you are asking for -- so I just turned it all on and then started a SQL backup job (ran for a couple minutes then aborted).

Saw these peaks:

Commands issued: reached 25741

Read requests: reached 12859

Read rate: reached 41151 kbps

Write requests: reached 12882

Write rate: reached 41176 kbps

These read/write rates are looking better than I was seeing in Navisphere earlier. Since I first posted I moved some of these LUNs between SPB to SPA to see if it'd make a difference. Maybe I had an SP a little too saturated or something - I doubt it though, there are only 5 hosts attached to the entire SAN, and it's the weekend, there is virtually no IO going on right now. Regardless, it's not compareable with 75MB/s read and write I was seeing previously on the physical box, and I don't understand it shooting % Idle Time in Perfmon to 0 either.

thanks,

Ben

mcowger · ‎08-09-2009

These aren't the values I need - I need the DAVG/cmd values....they are collected at the LUN level. I believe they are shown as 'Disk Latency'.

--M

--Matt

VCP, vExpert, Unix Geek

--Matt VCDX #52 blog.cowger.us

bbricker · ‎08-10-2009

The only options under disk are:

Read rate

Read requests

Write rate

Write requests

Commands issued

Stop disk command

Bus resets

Usage

mcowger · ‎08-10-2009

Take a look at this screen shot. This is the performance tab after clicking on a host.

--Matt

VCP, vExpert, Unix Geek

--Matt VCDX #52 blog.cowger.us

bbricker · ‎08-10-2009

Okay got it, sorry I was looking at the performace options of the VM guest and not the host itself.

The "physical device command latency" as well as physical device read and write latency are all 0. I would guess that is a good thing. What does that tell me? That the issue is not on the host level?

I followed the best practice guide for aligning and formating the Windows volumes the SQL data resides on, so hopefully that is not an issue in the guest VM as I'm not sure what else I could tweak inside it.

So does ESX let a single VM have free reign to use as much IO as it can or does it try and throttle so that the VMFS volume it lives on is not overwhelmed? I would think it would give it as many resources as it needed providing another VM was not requesting them.

Ben

mcowger · ‎08-10-2009

If its all zero you measured the value for the wrong LUN, or you had no load active on the LUN (its impossible to have a physical device latency of 0 unless you aren't doing anything) - try again making sure you display the value for all LUNs on the system by measuring the values on he host that is running your SQL VM.

ESX does not throttle the IO of VMs.

--Matt

VCP, vExpert, Unix Geek

--Matt VCDX #52 blog.cowger.us

bbricker · ‎08-10-2009

Good grief I'm making myself look helpless here, I can't seem to click anything right, haha. Yes, you are right I had selected the wrong LUN, they still have all those hellishly long identifiers and I was looking at the last 6 hex numbers and they were the same as another LUN (which coincidentally was not even attached anymore, and I had to reboot the hosts to get them to finally disappear when refresh and rescan didn't work).

Anyway- I have the right one selected now and currently the system is pretty idle now that it's past 5pm but I'm seeing:

physical device command latency: average 4.717, max 9, min 1

physical device write latency: average 1.778, max 12, min 0

physical device read latency: average 5.122, max 20, min 3

Then if I fire up some heavy IO like starting a SQL backup we see:

physical device command latency: max 25

physical device write latency: max 12

physical device read latency: max 39

Thanks,

Ben

mcowger · ‎08-10-2009

Those seem slightly elevated but nothing to be overly concerned about. Given that information, I'd say your array is not to blame, so you can start looking into ESX or the guest OS. Its very unlikely that the queues on ESX are causing problems, so I'd look into whats going on in your guest OS. At that point I can't help much, because I cant admin windows to save my life.

--Matt

VCP, vExpert, Unix Geek

--Matt VCDX #52 blog.cowger.us

newcal · ‎11-07-2009

Helo bbricker,

I experience the same problem as you are, did you find out ? And about the ESX throttle ? I was thinking too, to this possibility but i can't find any confirmation. Tomorrow i will perform aligment patition to see if it give some improvement ...

best regards

mcowger · ‎11-07-2009

ESX does not throttle the IO performance of guests in anyway....with the right backend, a single ESX server can push well over 300,000 IOPs.

--Matt

VCP, vExpert, Unix Geek

--Matt VCDX #52 blog.cowger.us

newcal · ‎11-07-2009

I agree with you: Vsphere is wonderfull !

My SAN MD3000i is wonderfull too, because a DELL R710 physical server transfert 100 MB/s data on its LUN / Raid5 / 4 hard disk SAS - 15000 tr (15 secondes to transfert a 3 Go file)

But the ESX (DELL R710 ) with .... 1 Vm reach 30 MB/s data transfert with writing a 3 Go file on it.

Monitor with performance ESX client or IOmeter or my wondefull swiss watch (2.5 minutes to transfert a 3 Go file to the SAN) ....

I don't think Vsphere is so wonderfull, but anyway i want to make it work ! ...

bbricker · ‎11-09-2009

After reading back through my original post I realized that I made a pretty severe mistake in my testing. The SQL restore process that I performed on the physical server as a comparison was not doing the read/write from the same LUN, it was actually on different LUN on different spindles. So I was not comparing like situations. It does appear that read/write from/to the same LUN within ESX is only a bit slower than on my physical servers, but not much. If I do the same restore in my SQL VM, even reading from another VM connected to the same VM switch across the network, it is much faster that a read/write within the same VMDK/LUN, providing that the other VM is also on a different VMDK/LUN. So I do think it is mainly the way I had the VM setup and the way I did my test comparison. I now have the SQL server setup so that Log/TempDB, Indexes, and UserDB's are all on different VMDK's, which are on different LUN's, which are on different physical spindles. I never see the disk IO go to zero idle time within PerfMON on that server during the day with normal usage. It will go to zero the second we start a large copy operation like SQL restore/backup within the same LUN though. This is compared to maybe 10% idle time on a physical server with less processing power.

So I guess if I came away with anything through this whole experience is that there is always going to be some overhead with the VMware layer between your hardware and your VM's, but using best practices of SQL by physically separating the IO will help.

I would first double check you are indeed performing the same comparison between your physical and virtual tests. Can you describe it in more detail?

mjpagan · ‎11-17-2009

Last time I ran into some IO issues with a customer I used this document to lead me to the source of the issue:

Hopefully it will help you or the next person having issues.

-Mike

Mike Pagán MCITP:EA, MCSE, VCAP5-DCA, VCAP5-DCD,VCP 5, VCP5-DT, CCNA, A+

All

Where to start with troubleshooting IO issues?