Hey vGeeks,
I have a long-running case with VMware Support (#12148282302) surrounding the behavior we observe when a snapshot creation is initiated in vSphere on a fully patched ESXi 5 host. I'll list our environment specs below, but the gist is that when I create a snapshot of a VM (doesn't matter if busy or totally idle), the VM becomes unresponsive to varying degrees for roughly a minute and then takes approximately one minute per gig of RAM (in the VM) to complete the snap. To accentuate the issue, I created a new VM with nothing running in it (other than W2k8R2) with 16GB of RAM and it take 14-18min to create a snap. This happens in both of our clusters/sites.
Where I need YOUR help:
If you have a bit of time (and I know this is a big ask), either test with an existing VM or with a new one and post back your stats and hardware config when creating a snapshot (with memory capture). Quiescing is irrelevant; this happens when the memory is captured on the snap regardless of anything else.
Our environment:
I appreciate any help/validation. We don't believe this happened on ESXi 4.x, but we're open to any explanation.
Thanks,
Chris
I'd trouble shoot this by looking at the disk performance as seen by the hosts and indeed in the guests. Using IOMeter on an affected VM, what sequential write throughput are you seeing with say 32k IOs and an 8 IO queue depth? And, at what rate do storage vMotion events run?
Hey J1mbo,
Already did all that (IOmeter, etc) with Support about 3 months ago (we've been working on this since February). It isn't an I/O issue (per them) nor is it affected whether it is on an isolated host or even writing to local storage (though that does seem to be worse, so there could be some correlation).
Support is currently trying to explain the unresponsiveness part of the problem (which lasts for roughly a minute regardless of RAM, and up to the 30ish% mark on creation progress) with the following KB article: http://kb.vmware.com/kb/1013163. I'll grant that some minor unresponsiveness or stunning is relevant to this, but not to the degree we're seeing.
All that said, not to ignore your questions, but we couldn't get much useful info out of IOmeter, because the VM is being repeatedly stunned, so the unresponsive periods are literally moments in which the VM is frozen, so no stats are recorded. What metric (GB/min?) would you like me to present you regarding the storage vMotion rate? IMO, we have really good performance with that.
Thanks,
Chris
Seeing the same thing on ESXi 4.1
Machines with 16GB of ram take approx 20 minutes to snapshot when including the memory, regardless of workload.
Confirmed on 3 different VMs.
R910's with iSCSI Equallogic PS600E.
Have you gotten anywhere with this?
vonsch,
That's good to know about 4.1. Perhaps we never snapped large VMs prior to 5.0 and thus thought it began with it. Even so, it shouldn't take a minute per gig to write the memory to disk (IMO).
The latest from support is that this is by design and they are working (at a very low priority, it seems) to repro it and collaborate with engineering on it. It's not hard to repro (obviously) but I still haven't heard a technical reason for why the mem snap lasts that long. If I hear anything, I'll post it here.
--Chris