VMware Cloud Community
dave_pierce
Contributor
Contributor

Fault Tolerance, 2 Host Cluster, ESX 6, Out of Memory?

I have 2 identical hosts (Dell R630s w/ 192GB RAM and 48 logical CPUs each.)

I have the same version of ESX 6 (3380124) installed on both, and they are in a cluster. vCenter 6 (2656760).

The vCenter server (4 core, 8GB RAM) is a VM in the cluster, and there are 26 identical Windows VMs (2 core, 4GB RAM) that are used by our QA department for automated testing.

  • So, a single host is more than adequate to host all the VMs in the cluster, even if CPU is slightly oversubscribed.

The two hosts have a pair of 10GbE NICs connected to each other, dedicated to FT traffic.

The two hosts are connected to a SAN via 10GbE iSCSI. Two LUNs are presented to the hosts for storage of VMs.

Heartbeat enabled for both LUNs. The primary VM images are on one, and the other is used for FT storage.

EVC Mode is disabled.

We want to enable Fault Tolerance on the VMs so that a single host failure doesn't stop automated testing. We thought this would be a relatively simple matter of having the second host duplicating everything on the first host.

I have set the following variables in the cluster configuration:

  • das.maxftvcpusperhost=80
  • das.maxftvmsperhost=40
  • das.slotMemInMB=1280

I have tried a few different slot sizes between 128 and 2048 MB, but it hasn't made a difference.

I can successfully enable fault tolerance on any 25 of the 27 VMs in the cluster. However, the remaining two will fail when spinning up the secondary VM, with:

Failed to prepare the virtual machine for enabling Fault Tolerance: Out of memory

Failed to start migration due to event callback failure.

This shows up in the Tasks view as an error on the secondary host.

I've gone line by line and the host config is identical. The hosts aren't even close to being out of memory. (At the moment, 54GB free on the primary with all the VMs active, 73GB free on the secondary system.)

Traffic on the FT NICs is chugging along at around 300MB/sec. (Well under the 10GbE limitation.)

Any thoughts? I'm kind of confused here. Smiley Sad

Thanks!

0 Kudos
2 Replies
virtualg_uk
Leadership
Leadership

Could you add up the total amount of memory configured for all VMs that you have FT currently enabled on and report back?

FT reserves memory so it cannot be over-committed at the secondary site, this might explain why you are getting an out of memory issue, even if actual memory usage on the destination side is low, there might be no reservable memory capacity left.

When Fault Tolerance is turned on, vCenter Server unsets the virtual machine's memory limit and sets the memory reservation to the memory size of the virtual machine.

While Fault Tolerance remains turned on, you cannot change the memory reservation, size, limit, or shares.

When Fault Tolerance is turned off, any parameters that were changed are not reverted to their original values.

vSphere Documentation Center


Graham | User Moderator | https://virtualg.uk
0 Kudos
dave_pierce
Contributor
Contributor

100 GB (25 * 4) at the moment.

Rooting around and taking into account slot size, number of slots per VM, etc., the 25 running VMs shouldn't be using more than 160GB of RAM, leaving an additional 30 or so free on the secondary.

0 Kudos