Solved: High CPU utilization during svMotion

pldoolittle · ‎12-15-2010

I am running ESX 3.5 U3 with vcenter 4.1. We have a new SAN (Clariion, all FC) and are migrating from the CX3-20 to the CX4-120. Storage is dual attached FC through a pair of Brocade switches into both SP's (4 paths per lun)

When migrating virtual machines from one datastore to another using SVmotion, some (most) guests go to near 100% CPU utilization for the duration of the move. Interestingly, some guests (notably MS file/print only servers) don't exceed 50%. Web Servers and DB's peg the gauge, even when user traffic is near zero (~200mhz). I have also noticed that there appears to be some residual "noise" on the CPU for some time after a migration (see attached image)

These systems are automated, so there are busy and slow times,but never a time when non-responsive server will go unnoticed. I'd like to be able to use svMotion and know that I don't have to schedule downtime.

Is this normal for svMotion? Is there a fix or workaround?

kjb007 · ‎12-15-2010

I could see this happening if you had a low number of spindles backing your datastore. If you are reading from the same spindles you are moving, which with an svMotion, you are. You're doubling the I/O requirements for the duration of the move. If you are further sharing those same spindles to host other virtual machines, then you can see the I/O requirement compounding. With an I/O starved system, the CPU will tend to spike as the CPU will be waiting for response from storage, while keeping up with new requests.

Using esxtop while you perform an svmotion will show you if physical disk latency has increased from your disk subsystem, which would lead to a higher CPU utilization.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

View solution in original post

kjb007 · ‎12-15-2010

I could see this happening if you had a low number of spindles backing your datastore. If you are reading from the same spindles you are moving, which with an svMotion, you are. You're doubling the I/O requirements for the duration of the move. If you are further sharing those same spindles to host other virtual machines, then you can see the I/O requirement compounding. With an I/O starved system, the CPU will tend to spike as the CPU will be waiting for response from storage, while keeping up with new requests.

Using esxtop while you perform an svmotion will show you if physical disk latency has increased from your disk subsystem, which would lead to a higher CPU utilization.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

pldoolittle · ‎12-15-2010

Thanks for replying. I understand, but these same LUN's are barely working during normal use (just checked one LUN at 1.46% utilization), and these moves are taking place during off-peak hours.

For reference, we have 5 or 7 spindles (FC 15K) per Raid 5 group, 3 LUN's per raid group, ~5 VM's per LUN. All VM's are windows 2K3 guests running windows boot partitions and/or IIS services. Data is stored in SQL databases stored on separate LUNS/spindles. In a nutshell, these disk's aren't doing much but updating Windows event logs.

Also, I don't see CPU spikes on other machines sharing those same disks/LUNs during the migration. Shouldn't I expect to if the raid group were struggling to service the load? Nor do I see spikes on the guest CPU during data loading, VM creation, or other disk intensive tasks that should stress an already marginal disk group.

I'll keep digging (and use ESXtop) to see if I can find the cause, but at first glance it seems like there may be latency reading/writing from the VMDK during svMotion, but not necessarily latency reading/writing from the LUN or datastore.

kjb007 · ‎12-15-2010

Understood. esxtop will show you latency produced at the guest, vmkernel, and device levels. That may help to figure out which part of the stack is causing issues.

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

pldoolittle · ‎12-15-2010

Thanks! Any suggestions for ESXtop config/options to best troubleshoot this issue?

kjb007 · ‎12-15-2010

Sure, start esxtop and start with 'v' for vm disk stats

Use 'f' to add in the IOSTATS and LATSTATS. Depending on your view size, you may have to turn on one and go back and turn on the other in turn.

The LATSTATS will you the DAVG GAVG and KAVG, for device, guest, and kernel avg latency introduced. The IOSTATS will show you actual I/O occurring at the time, and the default view itself will show you if you're queueing your I/O.

The other method would be to use 'd' and look at the adapter rollups, and see what your hba's are doing. You can use 'f' again to see the LATSTATS from the hba perspective.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB