VMware Cloud Community
John_Martin_RC
Contributor
Contributor

ESXI 6.0 Freezes while performing VIM25 Power-On operations

Product

VMware vSphere ESXi 6.0

Product version

6.0 U3 – 5050593

Severity

2 - High

Issue category

Fault/Crash

Issue description

Configuration
Have two ESXI-6 servers. Across these two servers, have 6 resource groups containing 36 virtual machines. Each server has 24 cores. vCPU assignments on each machine don't exceed 21 vCPUS. The Guesthost types are RHEL 6 and Solaris 10. On server 1, I have 1 additional RHEL 6 Guesthost that is the 'controller' application that uses VIM25 to start, stop, and otherwise manage the other instances..

Issue
Using VIM25 webservices, I issue startup of the first resource group of 7 servers. Followed within 30 seconds the second resource group of 7; these 14 vms are located on the first physical server. I then within 30 seconds start the 3rd group of 7 on the second second physical server. At this point, two issues occur:
1) On the server using the VIM25 client to do these startups, the load average of the machine grows to over 9, even hits 11 however, the CPU usage on the java pid containing the vim25 client calls, is well under 100 (bounces 55-60). My research thinks this behavior is due to vim25 client calls not completing thus causing contention??
2) The VIM25 client calls to poweron guesthosts will show up on the event log on server 2 and the status of that event goes into PROCESSING and stays there

At this point, server 2 becomes 'frozen', as well as the "controller" application Guest host. In some cases the only way to recover is to go to the text console of the server and issue a shutdown with force VM shutdown.

Need help with how to troubleshoot this issue. We have further complications, in that these servers exist in a secure lab and I cannot provide any type of logs from those servers.

0 Kudos
2 Replies
John_Martin_RC
Contributor
Contributor

Additional Info:

esxtop shows very little resource utilization on the Guest host's

themselves. vCPU utilization is only showing 21% at it's worst. If you

"throttle" the startup of the Guest hosts, ie: start 1st 7 servers, then

wait 30 seconds then start second 7, then wait 60 seconds start third 7,

and then wait 90 seconds for fourth, etc... then the esxi server does not

freeze. During this, if your watching performance on CPU from vSphere,

there is a massive spike reaching 100% and then of course the vSphere

client freezes along with the server itself. The only way to recover is to

issue server shutdown from console and have it kill the running vms.

What other diagnostics can I look for? How can I tell about the Webservice

server? If there any tuning I can do there? Maybe the IO contention is in

how many requests I've made via VIM into the web service server??

0 Kudos
John_Martin_RC
Contributor
Contributor

As little follow-up for those following along at home 😉

- helped the issue some by restructuring resource pools (Flattening out a very nested layout) and tuning those properly. Good resources:

"Where Did I Get Those Numbers?" section:

http://wahlnetwork.com/2012/02/01/understanding-resource-pools-in-vmware-vsphere/

vSphere Resource Management for ESXi 6.0  document and validate the best practices are applied or not:

https://docs.vmware.com/en/VMware-vSphere/6.0/vsphere-esxi-vcenter-server-601-resource-management-gu...

However, issue could still be reliably reproduced

- In looking at esxtop, noticed that my Solaris 10 64-bit Guesthosts instances were ALWAYS showing 130-150%USED. So I started trying to tune those instances; added some memory, added some vCPU... still no luck.

- Then I started testing using 3 Resource pools with 7 instances each, but only starting the RHEL 6.7 instances first and approximately 60 seconds later, starting the Solaris instances. That WORKED!

- Then tried all Solaris intsances first, wait ~90 seconds followed by all RHEL instances... THAT WORKED!

- Rinse and repeat.... each time NO ISSUE!

- Tried the the 'mixed os startup'.... ISSUE RETURNED!

Now trying to determine what is the resource contention within the esxi when doing mixed os startups!!!

0 Kudos