ckperry
Contributor
Contributor

Troubleshooting a Windows VM

If you have a typical real world VI setup with VC and serveral ESX servers hosting a number of Windows VMs and Linux VMs how would you troubleshoot this?

A couple of users of one of the Windows VMs mentions that the server is running slowly today.

What would be your first instinct to check?

I'm not necessarily looking for a "right" answer here as I am not sure there is just one right answer (unless you know the actual issue magically up front).

Now, the trick is the issue is a runaway process in the Windows VM is eating 100% of the VMs cpu and bogging down the server. (imagine that on a Windows box)

Thanks for any insight.

Chris

0 Kudos
7 Replies
TomHowarth
Leadership
Leadership

Remember, unless there is obvious performance issue on more that one guest it is unlikely to be ESX. VMotion the Guest to another host to see if this solves the problem. if not it is very likely a individual guest issue. Treat the VM as if it were a phyiscal server therefore troubleshoot as you normally would, if a Network issue, can you ping from guest to gateway, can you ping from Guest to machine on diferent network etc, if DHCP does it have an IP address, are the network properties correct, for sluggish performance, check if there any hung processes, check the event log etc.

check the history in VC this should show if there are any period of unusual activity on a host, cluster or DataCenter. what time of the day is this sluggishness happening, is it a logon storm. etc

Tom Howarth

VMware Communities User Moderator

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410
williamarrata
Expert
Expert

Starting with ESX Server 3.5 and VirtualCenter 2.5, VMware DRS applies a cap to the memory overhead of virtual machines to control the growth rate of this memory. This cap is reset to a virtual machine specific computed value after VMotion migrates the virtual machine. Afterwards, if the virtual machine monitor indicates that the virtual machine requires more overhead memory, VMware DRS raises this cap at a controlled rate (1MB per minute, by default) to grant the required memory until the virtual machine overhead memory reaches a steady-state and as long as there are sufficient resources available on the host.

For VirtualCenter 2.5, this cap is not increased to satisfy the virtual machine's steady-state demand as expected. Thus, the virtual machine operates with an overhead memory that is less than its desired size, which in turn may lead to higher observed virtual machine CPU usage and lower virtual machine performance in a VMware DRS-enabled cluster.

Diagnosing the Issue

To diagnose the issue:

  • 1. Log in to VirtualCenter with Virtual Infrastructure Client as an administrator.

  • 2. Right-click your cluster from the inventory.

  • 3. Click **Edit Settings.

  • 4. Disable VMware DRS.

  • 5. Click **O and wait for 1 minute.

  • 6. In the Virtual Infrastructure Client, note the virtual machine's CPU usage from performance tab and the virtual machine's memory overhead from the summary tab.

  • 7. Right-click your cluster from the inventory.

  • 8. Click **Edit Setting.

  • 9. Re-enable VMware DRS.

  • 10. Use VMotion to migrate a problematic virtual machine to another host.

  • 11. Note the virtual machine CPU usage and memory overhead on the new host.

  • 12. Disable VMware DRSon the cluster again, as noted aboveand wait for 1 minute.

  • 13. Note the virtual machine CPU usage and memory overhead on the new host.

If the CPU usage of the virtual machine increases in step 11 in comparison to step 6, and decreases back to the original state (similar to the behavior in step 6) in step 13 with an observable increase in the overhead memory, this indicates the issue discussed in this article.

You do not need to disable DRS to work around this issue.

Working around the issue

To work around this issue:

  • 1. Log in to VirtualCenter with Virtual Infrastructure Client as an administrator.

  • 2. Right-click your cluster from the inventory.

  • 3. Click **Edit Setting.

  • 4. Ensure that VMware DRS is shown as enabled. If it is not enabled check the box to enable VMware DRS.

  • 5. Click **O.

  • 6. Click an ESX Server from the Inventory.

  • 7. Click the **Configuratio tab.

  • 8. Click **Advanced Setting.

  • 9. Click the **Me option.

  • 10. Locate the **Mem.VMOverheadGrowthLimi parameter.

  • 11. Change the value of this parameter to 5 and click **O.

    **Note By default this setting is set to -1.

<h3<br />Verifying the workaround

To verify the setting has taken effect:

  • 1. Log in to your ESX Server service console as root from either an SSH Session or directly from the console of the server.

  • 2. Type less /var/log/vmkernel.

A successfully changed setting displays a message similar to the following and no further action is required:

vmkernel: 1:16:23:57.956 cpu3:1036)Config: 414: VMOverheadGrowthLimit" = 5, Old Value: -1, (Status: 0x0)

If changing the setting was unsuccessful a message similar to the following is displayed:

vmkernel: 1:08:05:22.537 cpu2:1036)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)

**Note: If you see a message changing the limit to 5 and then changing it back to -1, the fix is not successfully applied.

In the case that the fix is unsuccessful attempt the following:

  • 1. Create a new cluster and move the ESX Server hoststo this cluster.

  • 2. Check to see if the fix has been implemented successfully.

To fix multiple ESX Server hosts

If this parameter needs to be changed on several hosts (or if the workaround fails for the individual host) use the following procedure to implement the workaround instead of changing every server individually:

  • 1. Log on to the VirtualCenter Server Console as an administrator.

  • 2. Make a backup copy of the vpxd.cfg file (typically it islocatedin C:\Documents and Settings\All Users\Application Data\VMware\VMware VirtualCenter\vpxd.cfg).

  • 3. In the vpxd.cfg file, add the following configuration after the <vpxd> tag:

<cluster>

<VMOverheadGrowthLimit>5</VMOverheadGrowthLimit>

</cluster>

This configuration provides an initial growth margin in MB-to-virtual machine overhead memory. You can increase this amount to larger values if doing so further improves virtual machine performance.

  • 4. Restart the VMware VirtualCenter Server Service.

    **Note When you restart the VMware VirtualCenter Server Service, the new value for the overhead limit should be pushed down to all the clusters in VirtualCenter.

This issue will be addressed in a future VMware VirtualCenter update release. The workarounds will not be needed in the update release and in any subsequent releases of VirtualCenter.

Hope that helped. Smiley Happy

Hope that helped. 🙂
ckperry
Contributor
Contributor

Thanks for the responses. Anyone else care to share how they would start troubleshooting this issue?

C

0 Kudos
weinstein5
Immortal
Immortal

The first thing is ensure resources are not being constrained - particulalry memory and CPU - check the perfomance graph for the VM for CPU Ready time and see if there is ballooning occurring in the VM - ido this particularly when users notice an poor performance - you will be able to see if there is a jump in these values that coincides with the start of poor performance -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
ckperry
Contributor
Contributor

Thanks for the responses. I find it interesting that when people are used to being Virtual they seem to suggest working from the macro virtual structure in towards an individual machine (vm). I am just now coming from a regular sysadmin position and I am used to thinking that if I hear a Windows server is acting up from a user I would start with said server first and work my way back to the users workstation or network issues later. Knowing how Windows servers can be I would not think this mindset should have to change much when moving to a virtualized setup but, I guess there is enough complexity to properly spec'ing/configuring a virtual installation that the outside in approach is usually more productive. I wonder if, as the virtualization movement matures, things will eventually move to an inside out mindset again?

C

0 Kudos
Ken_Cline
Champion
Champion

I'm not sure the philosophy has changed all that much - I think what you're seeing is a result of the forum in which you're posing the question. Folks here are accustomed to helping with VI problems, so that's where they're likely to start, but I think that most would agree that diagnosing a problem within a guest should begin within the guest. I find that, more often than not, performance problems within the guest are caused by the following:

- - User perception (PEBCAK)

- - Improperly configured application

- - Improperly configured guest OS

- - A network problem that has nothing to do with virtualization

Once you get through the basics, then you start looking at the VI:

- - Default VM setting uses vSMP (all VMs have two or more vCPUs)

- - Performing backup / virus scan on all VMs at the same time (host overload)

- - Improperly configured networking (speed/duplex mismatch, misconfigured VLAN settings, etc.)

- - Improperly configured storage (zoning, "disk full", iSCSI/NFS not set up correctly)

- - Run away log files filling host filesystems

- - Under-provisioned VM

Obviously, these are just examples - there are many, many things that can go "bump" in the night...

Ken Cline

Technical Director, Virtualization

Wells Landers

VMware Communities User Moderator

Ken Cline VMware vExpert 2009 VMware Communities User Moderator Blogging at: http://KensVirtualReality.wordpress.com/
0 Kudos
ckperry
Contributor
Contributor

Ken,

Very well said. I did not mean to sound negative in my view, please excuse me if it came across that way. I asked the question to this group because I wanted their opinion on such a problem. As I said I was not expecting a "right" answer, just some input from different people.

Thanks again everyone. Smiley Happy

C

0 Kudos