HA - what happens when a failover host is memory c...

KenDedesko · ‎11-13-2018

My client has a ridiculous set up. I can't change it.

3 node cluster on 5.5. Each node has a 128 GB of ram. Each node has a large VM on it with 118 GB (we convinced the client to reduce vram down to 92% of physical ram) From 100%...like I said ridiculous).

HA is enabled. No memory or CPU reservations. Admission Control is disabled.

slot size is 900MB for memory and 256 MHz for CPU....again based on no reservations. 170 failover slots

Question.....if a host fails, then the large vm will restart on a surviving host. What happens?

The target host has 128 GB of ram and an existing vm of 118 GB..does unused memory on the target host get swapped to free up memory of the incoming vm with its 118 GB? How severely impacted is the vm that is being swapped out?

thanks.

ThompsG · ‎11-13-2018

Hi there and welcome to the community!!!

The standard memory reclamation process would kick in on the ESXi host as at this point memory would be under contention. This would mean: ballooning, compressing and swapping.

Most likely due to time constraints the ESXi host would drop straight to swapping and then back fill with the other techniques to free up physical memory as required. You can also take a look at the running VMs at the moment and the VM Consumed memory. It is possible that even though 128GB has been allocated to the VMs that they are not actually using the physical blocks of memory. For example: 128 GB assigned but only 64 GB being consumed on the ESXi host.

Impact of the running VM - this will depend on the speed of the disk subsystem but it will not be pretty. If you have ever started paging out on your desktop then you will know this is not a good place to be. With VMware swapping it will be even worse because its not the guest OS deallocating blocks of memory but the hypervisor. This could mean that memory be deallocated to disk that you require soon - this will create wait times as the block is swap-in again.

They also need to ensure that the guest OS have large enough paging files to satisfy the ballooning that might be required. Failure to do this could result in the running VM failing during an HA event due to lack of paging file within the OS. Take a look at this for what I mean: Swap Space and Memory Overcommitment

Here is another good article that shows the impact and process: VMware ESX Memory Resource Management: Swap - VMware Technical Journal

I would still be trying to convince them that this is a bad idea especially as a failure will lead to unpredictable results both in performance and recovery.

Kind regards.

KenDedesko · ‎11-13-2018

Thank you. I thought the usual memory reclamation techniques would kick in.

The active memory on these VMs is only about 16 GB.... So I would imagine that to re-start the VM there would be an initial larger issue as the recovery host would try to reclaim as much memory for the incoming VM as possible, then things would suit a bit... Still a terrible design I know.

Thank you

Ken

ThompsG · ‎11-13-2018

Yes depending on the guest OS it could be bad (Windows tends to grab all the allocated memory on boot then release) or not even noticeable.

Funny after all these years we still cannot convince people to right-size VMs

IRIX201110141 · ‎11-13-2018

Yes, the VM will be startet when HA kicks in because you have Admission Control disabled.

The Memory Savings within ESXi will kicks in slowly but HA will be lightning fast because it just needed to assign the VM to the host and powered in on. During powering on most likely all memory will be access by the GuestOS which means your Host runs out of memory and so the last line of defence comes into the game. The existing VM swap file.

Using Swap ist slooooow. There will be a different compared IP based external storage or a local SSD which is configured as a VM swap device.

It will effect your existing running VM and also the one which is restartet by HA. I have seen VMs which have Compressed, Swapped vMEM and they become unusable because of the slowness. Our 128GB SQL server VM tooks 30min to shutdown, than we flip the power switch to kill it, because of some compression memory.

I have to say sorry.. but HA means that you have some ressources as spare and single monster VMs makes it much harder.

Regards,

Joerg

KenDedesko · ‎11-14-2018

This all makes sense.... Thank you very much.

I just shake my head at the way this is set up.

Thanks again

Ken

All

HA - what happens when a failover host is memory constrained?