Hello,
We have a customer who required FT for their domain controllers on our platform. As I was troubleshooting another issue I noticed that all 4 of these VMs have %VMWAIT values that hover around 90 and will shoot up over 100. This is ESXi 5.5 so they have a single cpu. The %used value is generally less than 15.
I just wondered if this was a byproduct of having FT enabled, and/or if it was anything to worry about.
Are you referring to the primary or secondary VMs
Also, are the hosts that those VMs are running on, have high contention?
Could you send a screenshot of the virtual machine view of esxtop showing this behavior?
90 would not be bad for %WAIT but could be a problem for %VMWAIT. Duncan Epping has a great post about this over ah his Yellow Bricks blog: http://www.yellow-bricks.com/2012/07/17/why-is-wait-so-high/
I cannot find much information about this metric other than
%VMWAIT: percentage of time a VM was waiting for some VMkernel activity to complete (such as I/O) before it can continue. Includes %SWPWT and "blocked", but not IDLE Time (as %WAIT does).
Possible cause: Storage performance issue | latency to a device in the VM configuration eg. USB device, serial pass-through device or parallel pass-through device
It could be that this is expected behavior with FT VMs, can anyone else confirm?
Does anyone from VMware ever come in and look at these discussions? I'd like to know the cause of the high %VMWAIT value on an under-used host. I'm also wondering if having 2 VMs with high %VMWAIT values affects other VMs if their work is scheduled onto the same physical core. We have some slowness on a guest that serves as a small VDI server. We have reports of performance lag that wasn't there when this guest was on different, older hardware. I've had it all along on this same cluster, but I don't use my VDI as much as this customer, and now that we've moved their server to the same 2 host cluster they are complaining about lag. We've ruled out just about everything we can think of for the most part.
Whats the %RDY value for the VDI server when on the host with the 2 FT VMs?
Can you check network bandwidth too on this host as FT is usually very network intensive.
I do not think VMware check these forums much, you would need to log a support case with them if you have a support agreement.
%READY is generally under 1. Watching it now it is varying between .4 and .85
Network is 10 Gb/s and not much latency there.
VMware support seems to have suffered compared to my experiences from just a few years ago. Almost not even worth the time to open a ticket. We're a VSPP, so yes we have support.
%RDY less than 4 generally means no noticeable CPU contention so we can rule that out. Issue could possibly be disk IO / disk latancy, can you send over a screenshot of that metric for this host form the performance charts?
What's your storage array? vendor / model?
Storage is a VNX5700. Not much going on with this really. What I capped is more activity than is normally on here actually.
I opened a ticket with VMware and they just pointed to an outdated storage driver. Funny thing is this is part of a VCE vBlock and they just did a complete upgrade for us a few months ago.
Did they say anything about that metric and FT combination? Ie would that be expected with an FT VM?
I have a ticket open with VCE now. They seem more interested in actually helping figure this out versus just telling me to update a storage driver. Having said that, the support person has to go back and ask some senior engineers because he isn't sure at all about the situation. Seems like there is not much knowledge out there about the %VMWAIT value other than they know what the threshold should be.
Let us know how you get on, as you say, there isn't a lot of documentation regarding this metric and FT
I worked with a VCE tech today and he said "The hosts are experiencing heap exhaustion in relation to the vmkStateLogger (FT)." We also have vcpu hotadd turned on for the FT guests which is not a supported configuration. He had a hard time digging up information himself, but the high VMWAIT times cannot be attributed to anything they would normally point to like storage i/o latency. Our first course of action he wants us to try is turning off 'vcpu hotadd' for the guests. This will require us getting a maintenance window from the customer, so I am not sure how soon this can be done.
Very interesting, thanks for posting back. Would be great to hear if this solves the issue (I was not aware that CPU Hot-add was not supported in FT enabled VMs, but it makes sense given how FT works that this would be a limitation !!)
I may lab this myself and put it up on my blog when I get a second.
When you think about it, it makes sense. In 5.5 you can only have FT enabled with 1 vcpu, so it does no good to have vcpu hotadd enabled. And apparently with vcpu hotadd enabled it does not use NUMA, but instead reverts back whatever came before NUMA. I can't remember what it's called.
Just called UMA --, because NUMA is just the "Non-" Version of it
But I don't think Hot-Add disabling vNUMA in VMs is relevant with FT. This is because no matter what number of virtual Sockets and Cores Per Socket you set (use Sockets only unless you have a good reason like licensing), vNUMA only kicks in if you assign 9 or more vCPUs to a VM.
Unless you edit some nifty advanced parameters, VMs will always see a UMA topology. And then again this is only what's happening inside the guest and does not necessarily correlate to how virtual cores are scheduled on the physical topology.
5.5. sorry, yes, was thinking about v6