We run 3 hosts running esxi as a DRS cluster and we run 14 vms across all of them. These are a mix of busy web and database servers. We are running 2 Dell 1950s, 1 Dell 295 each with 2 x cpus (3ghz, xeon x5450) with 32 gb ram. The servers have emulex fibre cards in them and are running against emc shared storage. The VMs are running on a mix of 2/4 vcpus with 8gb ram each.
When we put one of the hosts into maintenance mode and migrate the VMs off we start to run into problems. Obviously the VMs start to run a bit slow after the initial migration and i expect this whilst memory ballooning kicks in. The problem is that the some of the sites running on the VMs timeout and do not come on! We seem to then go through a period of random sites going offline/timing out. There doesnt seem to be a point where things settle down.
We are not running a lot of VMs. The only obvious thing i can think of is that the VMs are over specced. At the moment we would really stuggle if one of the host machines suffers a hardware failure.
Anyone any ideas or thoughts on the issue?
I think you are on the right lines regarding the 'over-specced' VMs.
The factors to look at are resource allocation for the VMs on those hosts. Are the reservations too high? Have you set any limits? If you believe that your reservations and limits are ok then you could consider utilising shares (through resource pools) to manage contention on your remaining two hosts during an outage or maintenance.
I had a similar issue with 3.5 HA in that due to a machine that had been given a massive memory reservation, the 'slot size' for HA was massive. This meant that the remaining hosts could only power on 4 of these theoretical machines during an outage.
Hope this helps.
Hi, it's not just going to be memory thats causing the degredation in performance, by allocating 2-4 vCPU's per vm your going to run into processor scheduling problems. The way ESX works is when a vm with 4 vCPU's needs processing power it will only be allocated processor time when 4 vCPU's are available. If your box has 2 quad core CPU's then you've got 8 vCPU's, now think you've got 4 vm's running on that box each with 4 vCPU's - each vm can only be allocated processor time when 4 vCPU's are available which gives you a lot of processor contention. I think you can see where this is going..
thanks for the advice. I agree it is probably down to the VMs being over specced!! There is possibly one VM that i can drop down to 2 vcpus (Once I move a couple of sites off!!)
I have some random per vm reservations that I have taken out. I am going to look at applyingthe reservations on the resource pool level.
Anyone got any tips on how i could look at the amount of memory the VMs have? I was readying a thing about vmkernel wastefinder appliance but i cannot see anything on the vmkernel site. All that i can find is this youtube vid:
Not sure it's any good for managing / reclaiming VM memory and CPU usage though. It's mainly aimed at cleaning up uncomitted snapshots, orphaned VMs etc.
You can obviously see the VM resource allocation for VMs in the VI client, but if you want an overall view of the environment you could use the PowerCLI / PowerGui.
Another good document is the Resource Management guide (attached)
Have a look and see what you think!
Thanks for the advice so far.
I am just going through each of the VMs to set a memory limit to try and bring down the memory usage on each of the hosts. Am I along the right lines?
I have managed to get the memory usage down to 65% on each down from 78%!! We are currently experiencing a memory leak on esxi but i also would like to be to bring one of the hosts to try and test our solution running on 2 hosts. I have also reduced the number of CPUs on a couple of servers.
Once i have figured out how to get things running on 2 hosts i will remove all of the limits. I think we need to pring another host online i think!
The odds are that you have over subscribed CPU rather than Memory.
CPU contention / scheduling is probably the candidate here.
Try scaling your hosts down to 1 /s vCpus and see how it performs. If this works ok, then rethink your vCPU strategy - and possibly create reservations where applicable.
We are about to start scaling back the number of vCPUs in our VMs. We have a couple of VMs that we are going to drop from 4 to 2 vCPUs.
One further question about CPU scheduling. If a VM with 1 vCPU is running on a dual xeon system with the 1 vCPU take up all of the cores? I mean with the other VM's (the ones with 4 vCPU - and the same if there are a couple of 2 vCPUs waiting to run) have to wait for all cores to be available? Will the VMkernel schedule 3 cores on one CPU and one on another?
Thanks for the advice
A core is treated as a vCPU. You absolutely need to move away from 4 cpus and I think most would agree that you should start them all at 1 vCPU.
SMP VMs can be scheduled across physical CPUs, although the kernel will try to avoid doing so. This has very detrimental affect on performance.
As others have said, start wih 1 vCPU and only increase if there is a need to do so.
Using Capacity Analyzer, we noted that we had relatively low CPU, but higher CPU Ready metrics. It was clearly displayed on the table.
Using the RightSizer module, we defined the parameters for how we wanted the VMs to perform given proper allocations
RightSizer told us that several of our VMs were way over-allocated, and by removing extraneous vCPU (as people have mentioned here), we were able to resolve a ton of the CPU Ready that was being experienced.
For the remaining VMs experiencing CPU Ready, but now with a single vCPU, we were able (with Capacity Analyzer) to see and filter to all the hosts, and order them by CPU Ready. Then we were able to vMotion some of the VMs, and perform the load balancing we needed since DRS doesn't seem to work to accommodate CPU Ready.
Each core will provie you with an additional vCPU - so you can have 4 different VMs each using 1 core . . (1vCPU) from the same physical processor.
the problem is giving 4 vCPUs when you have only 4 cores .. means that VM needs exclusive access to the processor on the host.