bulletproof
Contributor
Contributor

RHEL 5.3 Guests becoming unresponsive when balloon driver is invoked

A couple of weeks ago, a guest RHEL VM, which is only under light load, locked up and became unresponsive, requiring a reboot for recovery. SSH, HTTP and other services had stopped, and after connecting to the console for 3-4 minutes, I eventually got a login prompt, but never got to the point of being able to issue commands. Reboot has fixed things, and we have rebalanced memory load across our ESX cluster to prevent re-occurrence.

Looked at the performance graphing in VIC:

- it was using about 135MHz in cpu cycles

- zero or negligible I/O

- there was about 2.5-2.7GB of memory ballooning in action, and some swapping

- there was a memory alarm on the physical ESX host, but it still had about 10GB free in physical RAM

Guest OS has 4GB RAM allocated. It is in a resource pool which receives 4000 memory shares out of a total 30000, and currently shares this pool with 5 other VMs on the same physical host. CPU shares are similar, but CPU was in abundance. Actual resource use is ticked as 'unlimited' for CPU/Memory, and shares are equivalent across VMs in the pool.

I have done some research through community threads (such as http://communities.vmware.com/thread/133290) and various memory allocation documents (http://www.vmware.com/files/pdf/large_pg_performance.pdf, http://www.vmware.com/pdf/VI3.5_Performance.pdf, and the Resource Management Guide in particular.)

What appears to have happened is that the idle memory tax, in combination with the resource pool shares, were causing this low activity VM to experience a higher level of ballooning than the default 75% setting we have for the cluster, to the point where it had swapped some critical OS memory and became unresponsive.

Is it possible that a RHEL / linux VM can be so inactive that it reaches an inoperable equilibrium state for balloning vs swap and services drop out entirely?

0 Kudos
8 Replies
weinstein5
Immortal
Immortal

If ballooning is active it indicates that there are not enough memory resources to satisfy the needs of the VMs in the resources pool - What are the memory settings for the VMs? Are the shares set to Normal for the VMs ?Did you check the ther VMs and see if they are ballooning? What are the memory settings fo the other VMs? What is liimit setting for the Resource Pool? Is it set to unlimited? How much memory is in your host?

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
bulletproof
Contributor
Contributor

Thanks weinstein5.

Yep - ballooning was happening across all VMs... some small performance degradation was apparent but not the crippling unresponsiveness that occurred in this RHEL guest.

The other VMs are all likewise set to 'unlimited', as is the resource pool (our resource management is all done via shares, rather than limits/reservations). All 6 VMs in this resource pool are set to 'low' memory shares. The physical host has 52GB of RAM, and there are 47 VMs running on it. It was using 42GB out of 52GB at the time.

0 Kudos
weinstein5
Immortal
Immortal

Do all the VMs have the same memory settings - 4 GB? How many other resource pools do you have and what are their shares set to?

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
bulletproof
Contributor
Contributor

Yes - all other VMs in the pool have 4GB allocated.

The resource pools are setup as follows: 8000, 6000, 6000, 4000, 4000, 2000. This pool is one of the 4000 share pools, out of a total of 30000 shares.

0 Kudos
ac57846
Hot Shot
Hot Shot

Remember that without a reservation there is no guarantee that a particular machine will get any physical RAM.

Best practise is to set a reservation on each VM that will provide adequite performance, this would prevent the VMKernel ballooning away to much memory from the VM.

I am surprised that you had so much balooning when there was still plenty of free RAM, but reservations are the correct tool to prevent the issue you experienced.

bulletproof
Contributor
Contributor

You're absolutely right.

I suppose my query relates to this happening more often within low activity Linux guests, particularly RHEL. Although the Windows VMs can suffer performance losses with the setup we are using, we have not seen them becoming entirely unresponsive/unrecoverably slow.

0 Kudos
aalvm
Contributor
Contributor

Were you guys able to resolve this issue?

I am having the same problem with our RHEL guests.

I do not have any resource pools setup and I have more than ample memory/CPU available to the guest. No memory ballooning is occuring.

When the guest becomes unresponsive, the CPU idles below 100mhz. The problem is instantly resolved by migrating the VM to another server (no reboot required)

The problem appears to be totally random.

Any help would be great.

0 Kudos
aalvm
Contributor
Contributor

It might be interesting to note that my VMWare environment uses AMD cpu's

0 Kudos