Re: Weird performance problem. Memory contention i...

wingphil · ‎12-02-2009

Hi there,

We have two brand new Dell PowerEdge R610 that came with ESXi 3.5 embedded, and I'm getting some weird guest os performance problems. At first I thought it was our software, but then I started using prime95 from to stress test the VMs and the results are confusing.

First a bit of background, our software is java based and uses up to half a gig of ram. It also maxes out the CPU. The guest OSes are Win XP 32bit fully updated. The ESXi is fully patched and the VMWare tools are up to date on each guest. Each VM has 1 gb of ram, which I've reserved for it in the host resource allocation. I've also limited each VM to 900mhz of cpu. The server has 8 cores at 2ghz (HT turned off) and 16gb of ram.

If you select the option in prime95 to test only the CPU, each VM maxes out at the 900mhz limit. If you set it to use half a gig of ram, thereby mimicking our java software, the VMs drop down to a few mhz of cpu usage and all the vms become unresponsive, even the ones not currently running prime95.

I've got 9 of these vms running, but the problem becomes apparent when only 2 or 3 of them are actually running prime95. At 900mhz and 1gb ram, the server should be able to run 15 of them with prime95 running, even allowing for host overhead, right?

I'd be very grateful if anyone could help me with my configuration. There's probably something I've not set up right, but my feeling is that ESXi should be able to at least a few such vms without contention out of the box.

Any help much appreciated,

Phil

jfelinski · ‎12-02-2009

By saying : "If you set it to use half a gig of ram", you mean setting 500MB limit on VM for RAM usage? If yes, you're actually triggering reclamation process (as Guest OS is not aware about Memory limit and is trying to use it) Host will first start to baloon the memory and later host swapping which will dramatically affect performance.

Check memory/baloon and memory/swap performance counters.

---

MCSA+S, VCP 3, VCP 4

http://wirtualizacja.wordpress.com

--- MCSA+S, VCP 3, VCP 4, vExpert [url=http://wirtualizacja.wordpress.com]http://wirtualizacja.wordpress.com[/url]

wingphil · ‎12-02-2009

Hi, and thanks for your answer.

I'm not changing the VM ram settings - the guest sees 1gb and I've reserved 1gb in the host resource allocation, with no upper limit.

What I meant was that prime95 allows you to select how much ram you want it to use for its stress test. If you use none (or very little) there's no performance problem, but if you use 512mb everything slows to a crawl. I can't see why this would be as there shouldn't be any memory shortage for the guest or host.

Memory/balloon and memory/swap counters show zero across the board.

jfelinski · ‎12-02-2009

Check what is the CPU utilization per Core in Performance tab for Host. Maybe you have affinity settings and all VM's are running on one core? It will be also interesting to see CPU/Memory graphs per VM

---

MCSA+S, VCP 3, VCP 4

--- MCSA+S, VCP 3, VCP 4, vExpert [url=http://wirtualizacja.wordpress.com]http://wirtualizacja.wordpress.com[/url]

wingphil · ‎12-02-2009

OK, right now I have 4 vms running prime95, everything else is powered down. As each vm is limited at 900mhz cpu, the total usage should be around 3600mhz, or around 22 percent (16ghz available on the host). As you can see from the attached file it's averaging less than half that. All the cores appear to be in use, and I definitely have no affinity settings on the VMs.

Also attached is the CPU and memory graph for one of the VMs. Thanks very much for your help!

jfelinski · ‎12-02-2009

If you have access to rCLI or VIMA, could you please post result of resxtop during this global sloweness.

This link shoud give you better understanding how to read esxtop counters - http://communities.vmware.com/docs/DOC-3930

It might also help to check DISK latencies, as maybe swap file is beeing used during prime tests - go for following counter DISK->write latency,read latency,queue command latency

---

MCSA+S, VCP 3, VCP 4

http://wirtualizacja.wordpress.com

--- MCSA+S, VCP 3, VCP 4, vExpert [url=http://wirtualizacja.wordpress.com]http://wirtualizacja.wordpress.com[/url]

wingphil · ‎12-02-2009

Here you go, the disk latency appears not to be the issue. The prime95 test doesn't stress the disk AFAIK.

wingphil · ‎12-02-2009

Here you go. The disk latency does not appear to be the problem.

jfelinski · ‎12-03-2009

Strange, it all looks good. Has it been taken during this global slowdown of all VM's? What is the memory usage from resxtop during this behaviour?

---

MCSA+S, VCP 3, VCP 4

--- MCSA+S, VCP 3, VCP 4, vExpert [url=http://wirtualizacja.wordpress.com]http://wirtualizacja.wordpress.com[/url]

wingphil · ‎12-03-2009

Yes, it was. It's not actually a very typical snapshot, in that all the vms are managing to get some work done, but you can see that three of them are running at about 15% of one core, when it should be around 900mhz or 45 percent of one core. Here's another one, and one of the memory.

jfelinski · ‎12-03-2009

Honestly, it all looks alright. Only one extra thing i can think of is timer-interrupt rate, as Java apps likes to raise a default VM timer to some strange values. In resxtop in CPU stats add extra Summary Stats (press f) and check TIMER/S value, it shouldn't be higher than 1000. Another extra interesting bit will be guest CPU utylization during this periods.

Running out of options

---

MCSA+S, VCP 3, VCP 4

--- MCSA+S, VCP 3, VCP 4, vExpert [url=http://wirtualizacja.wordpress.com]http://wirtualizacja.wordpress.com[/url]

wingphil · ‎12-03-2009

TIMER/s just shows blank with no values for all processes.

Task manager within the guest shows 95-99% usage for the prime95 task, even while the %USED of that vm in esxtop is less than 1%.

Could it be to do with the way I have created the VMs? I built one template VM from scratch with a windows XP CD, then cloned it. There didn't seem to be a clone feature in ESXi so I did it manually by cloning the virtual disk with vmkfstools, and copying the vmx file and then editing values.

I edited the following values: "displayName" and "scsi0:0.fileName". I deleted the following so they would be recreated: "ethernet0.generatedAddress", "ethernet0.generatedAddressOffset", "uuid.bios", "uuid.location", "sched.swap.derivedName".

Is this a valid way of creating clones from a template?

jfelinski · ‎12-03-2009

I think, only one thingie left, is to export logs from ESXi host and go through VM and Host logs, maybe you'll find something suspicious.

file->export->export diagnostics data

Good luck

---

MCSA+S, VCP 3, VCP 4

--- MCSA+S, VCP 3, VCP 4, vExpert [url=http://wirtualizacja.wordpress.com]http://wirtualizacja.wordpress.com[/url]

wingphil · ‎12-09-2009

Well, thanks for all your help anyway. We eventually figured out the problem was that the template VMs had originally been converted from VMWare Workstation.

My colleague tried creating a fresh VM from scratch directly on the ESXi host, and then exported and reimported it a few times, and the resulting VMs exhibited none of the performance problems that my ones did. So I downloaded the VMX files from the host datastore, diffed them, and merged over from his VMX file into mine anything that wasn't specific to the filename/location/uuid/MAC address of the VM. Then I unregistered mine, uploaded the modified VMX, and re-registered. And that did the trick.

There were a lot of differences and I have no idea which one was causing the problem, but if anyone has a similar problem, try the above and see if it helps.

Cheers,

Phil

All

Weird performance problem. Memory contention issue?