ESXi 5 and 6 come to a halt when starting too many...

juliuslerm · ‎07-03-2016

I've been managing ESXi 5.5 and 6.0 for a while now, supporting hundreds of VMs for development purposes.

We are developing tools to automate installation of big data solutions, so each developer needs about 20 to 30 VMs.

The workload executed on each VM is pretty light, consisting basically of installing products and open source components and services, and start/shutdown of those components and services.

The problem I have experienced in pretty sizeable physical servers is that the installation of brand new OS instances and the boot up of those images is very expensive.

I've been dealing with servers that consist of 1 to 3TB RAM, with 48 to 96 cores, plus 22 disks each.

Some of the systems have 22x1.2TB SSD in RAID-5.

Others have the 22 disks as 8TB HDDs each, split into RAID-5 and RAID-6 volumes.

In all systems, the ESXi OS is placed on 2xSSDs in RAID-1.

These are pretty large machines, so memory, processor and storage should not be a problem.

Due to the nature of the usage for these systems (again, primary use is software installs and start/shutdown of VMs), I've been creating around 300 VMs on each of those servers.

Once all VMs are started, the system performs OK, no complaints.

However, it's the VM starts that bring those ESXi servers to their knees.

I need to add a "start-sleep 5" in between VM starts (start-vm), so the start ups are spaced out over time.

If I do that, then they all come up, but it takes time.

Yesterday I did the mistake of trying to start several VMs with start-vm without sleeps in between.
That brought the entire ESXi host to a halt, including the ESXi console.

I wonder if any one has anything to say on how to tune ESXi servers to avoid bringing them to a halt with too many concurrent VM starts.

I understand things could go slower, but I'm talking about those large servers becoming completely frozen, even its console.

I had to do a power reset and in the process, the VMs got corrupted and I spent a long time fixing *.vmx files and such, and re-installing OSs on those Vms.

Why are OS installs and VM starts/shutdowns so expensive in ESXi?

Is there anything I can do to avoid those freezing problems?

Just want to make it clear that I did try all SSDs, and they seem to alleviate a bit, but still, this is very concerning that if someone tries to start too many VMs at once, that will cause the entire system to freeze and corrupt VMs.

Thanks,

Julius

peetz · ‎07-04-2016

Hi Julius,

with boot storms Disk I/O performance is crucial and the most limiting factor. RAID 5 and 6 are not the best choices for performance (esp. when it comes to write performance) even when it is handled by hardware RAID controllers, and even if using SSDs.

To mitigate the effects of boot storms you definitely need to investigate (and possibly invest) in this area.

What RAID controllers are you using? What cache sizes do they have? Do they have BBWC (battery backed write cache - a must with virtualization)? How is the read/write cache ratio?

- Andreas

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

juliuslerm · ‎07-04-2016

Thanks for your reply.

I believe I found the cause (at least down to some level), which is the fact that I was migrating over to the a new ESXi 6 server a set of Windows VMs that were defined wih SCSI type "LSI Logic SAS".

Their original ESXi 5.5 server was running OK, since it has a different RAID controller, Adaptec, as opposed to LSI in the newer servers.

I had already realized this causes serious problems when I deployed a couple of other servers and took note of it.

The solution is to rely solely on Paravirtual SCSI type.

But in this migration I simply overlooked the legacy VMs and the simple presence of a few Windows VMs (10) wreaked havoc on the entire server.

The vast majority of the VMs are CentOS 6 and 7 and created with Paravirtual, but even then, their start/shutdown suffered horribly, even after the Windows VMs were started up.

I recreated the Windows VMs with Paravirtual SCSI controller type and now everything is OK.

I can reduce the interval between the VM starts to only 2 seconds and the most latency that I observe is under 10ms.

I was even able to start all Windows VMs at once from the UI, and latency went up to about 40ms, but the server never got any close to suffering as a whole.

Before recreating them, this exact operation brought the entire ESXi server to a freeze.

In terms of the storage configuration, I definitely understand RAID 5 and 6 are not the most efficient.

I defined that way to reduce risk of failures for the several developers.

With the updated VMs, I can see there is a spike in the beginning of the simultaneous Windows starts that go up to about 120MB/s, then levels at about 20-30MB/s for the rest of the duration of the Windows VM starts.

All that with about a max of 40ms latency.

With LSI Logic SAS, without simultaneously starting Windows VMs, but doing so with 6-second intervals, the latency while starting up CentOS VMs would go up in the severl hundred ms.

And when it came to the point of starting up the Windows VMs, the server would crash.

So I can't technically elaborate why VMWare LSI SCSI controller types don't work well with those types of servers.

I did try to follow up with Softlayer tech support, and all they said was to use Paravirtual and use SSDs.

But based on my own experience, I can categorically say that LSI Logic SAS controller should be avoided at all costs and using Paravirtual is a must.

peetz · ‎07-04-2016

Thanks for your detailed answer. Very interesting lesson learnt!

The pvscsi controller significantly reduces the compute overhead of virtual SCSI emulation, but I did not expect this to make such a difference.

If I understand correctly this was originally triggered by switching from Adaptec to LSI RAID controllers (in the physical hosts), but at the same time you switched from ESXi 5.5 to 6.0, so this could also be the reason. Correct?

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

juliuslerm · ‎07-04-2016

I actually experienced the exact same disastrous result when keeping VMWare LSI SCSI controllers when moving across servers with ESXi 5.5 (the original one with Adaptec RAID and the target server with LSI).

So I'm certain the problem does not involve the upgrade of ESXi versions.

All

ESXi 5 and 6 come to a halt when starting too many VMs simultaneously