Fault tolerance and high %VMWAIT value

shane1973 · ‎07-27-2016

Hello,

We have a customer who required FT for their domain controllers on our platform. As I was troubleshooting another issue I noticed that all 4 of these VMs have %VMWAIT values that hover around 90 and will shoot up over 100. This is ESXi 5.5 so they have a single cpu. The %used value is generally less than 15.

I just wondered if this was a byproduct of having FT enabled, and/or if it was anything to worry about.

virtualg_uk · ‎07-27-2016

Are you referring to the primary or secondary VMs

Also, are the hosts that those VMs are running on, have high contention?

Could you send a screenshot of the virtual machine view of esxtop showing this behavior?

90 would not be bad for %WAIT but could be a problem for %VMWAIT. Duncan Epping has a great post about this over ah his Yellow Bricks blog: http://www.yellow-bricks.com/2012/07/17/why-is-wait-so-high/

Graham | User Moderator | https://virtualg.uk

shane1973 · ‎07-28-2016

No contention issues. Both are secondary - CATXDC02 and AATXDC02

After checking, it is both secondary on each of the 2 hosts that have the high %VMWAIT values

So on the other host it is CATXDC01 and AATXDC01

virtualg_uk · ‎07-28-2016

I cannot find much information about this metric other than

%VMWAIT: percentage of time a VM was waiting for some VMkernel activity to complete (such as I/O) before it can continue. Includes %SWPWT and "blocked", but not IDLE Time (as %WAIT does).
Possible cause: Storage performance issue | latency to a device in the VM configuration eg. USB device, serial pass-through device or parallel pass-through device

It could be that this is expected behavior with FT VMs, can anyone else confirm?

Graham | User Moderator | https://virtualg.uk

shane1973 · ‎07-28-2016

The DAVG and GAVG numbers for the 2 hosts are very low, so it shouldn't be storage i/o related. They're domain controllers and really not doing much. I'm assuming it is a product of FT, but it would be nice to get a confirmation as these numbers seem HUGE.

shane1973 · ‎08-02-2016

Does anyone from VMware ever come in and look at these discussions? I'd like to know the cause of the high %VMWAIT value on an under-used host. I'm also wondering if having 2 VMs with high %VMWAIT values affects other VMs if their work is scheduled onto the same physical core. We have some slowness on a guest that serves as a small VDI server. We have reports of performance lag that wasn't there when this guest was on different, older hardware. I've had it all along on this same cluster, but I don't use my VDI as much as this customer, and now that we've moved their server to the same 2 host cluster they are complaining about lag. We've ruled out just about everything we can think of for the most part.

virtualg_uk · ‎08-02-2016

Whats the %RDY value for the VDI server when on the host with the 2 FT VMs?

Can you check network bandwidth too on this host as FT is usually very network intensive.

I do not think VMware check these forums much, you would need to log a support case with them if you have a support agreement.

Graham | User Moderator | https://virtualg.uk

shane1973 · ‎08-02-2016

%READY is generally under 1. Watching it now it is varying between .4 and .85

Network is 10 Gb/s and not much latency there.

VMware support seems to have suffered compared to my experiences from just a few years ago. Almost not even worth the time to open a ticket. We're a VSPP, so yes we have support.

virtualg_uk · ‎08-03-2016

%RDY less than 4 generally means no noticeable CPU contention so we can rule that out. Issue could possibly be disk IO / disk latancy, can you send over a screenshot of that metric for this host form the performance charts?

What's your storage array? vendor / model?

Graham | User Moderator | https://virtualg.uk

shane1973 · ‎08-04-2016

Storage is a VNX5700. Not much going on with this really. What I capped is more activity than is normally on here actually.

I opened a ticket with VMware and they just pointed to an outdated storage driver. Funny thing is this is part of a VCE vBlock and they just did a complete upgrade for us a few months ago.

virtualg_uk · ‎08-05-2016

Did they say anything about that metric and FT combination? Ie would that be expected with an FT VM?

Graham | User Moderator | https://virtualg.uk

shane1973 · ‎08-09-2016

I have a ticket open with VCE now. They seem more interested in actually helping figure this out versus just telling me to update a storage driver. Having said that, the support person has to go back and ask some senior engineers because he isn't sure at all about the situation. Seems like there is not much knowledge out there about the %VMWAIT value other than they know what the threshold should be.

virtualg_uk · ‎08-09-2016

Let us know how you get on, as you say, there isn't a lot of documentation regarding this metric and FT

Graham | User Moderator | https://virtualg.uk

shane1973 · ‎08-10-2016

I worked with a VCE tech today and he said "The hosts are experiencing heap exhaustion in relation to the vmkStateLogger (FT)." We also have vcpu hotadd turned on for the FT guests which is not a supported configuration. He had a hard time digging up information himself, but the high VMWAIT times cannot be attributed to anything they would normally point to like storage i/o latency. Our first course of action he wants us to try is turning off 'vcpu hotadd' for the guests. This will require us getting a maintenance window from the customer, so I am not sure how soon this can be done.

virtualg_uk · ‎08-10-2016

Very interesting, thanks for posting back. Would be great to hear if this solves the issue (I was not aware that CPU Hot-add was not supported in FT enabled VMs, but it makes sense given how FT works that this would be a limitation !!)

I may lab this myself and put it up on my blog when I get a second.

Graham | User Moderator | https://virtualg.uk

shane1973 · ‎08-11-2016

When you think about it, it makes sense. In 5.5 you can only have FT enabled with 1 vcpu, so it does no good to have vcpu hotadd enabled. And apparently with vcpu hotadd enabled it does not use NUMA, but instead reverts back whatever came before NUMA. I can't remember what it's called.

MKguy · ‎08-11-2016

Just called UMA --, because NUMA is just the "Non-" Version of it

But I don't think Hot-Add disabling vNUMA in VMs is relevant with FT. This is because no matter what number of virtual Sockets and Cores Per Socket you set (use Sockets only unless you have a good reason like licensing), vNUMA only kicks in if you assign 9 or more vCPUs to a VM.

Unless you edit some nifty advanced parameters‌, VMs will always see a UMA topology. And then again this is only what's happening inside the guest and does not necessarily correlate to how virtual cores are scheduled on the physical topology.

-- http://alpacapowered.wordpress.com

virtualg_uk · ‎08-11-2016

5.5. sorry, yes, was thinking about v6

Graham | User Moderator | https://virtualg.uk

All

Fault tolerance and high %VMWAIT value