g1xx3rb0y
Contributor
Contributor

Weird ESX 3.5i VM unresponsive issue

Jump to solution

Hi Everyone,

I have a very very weird issue on 2x HP DL380 G5 boxes with ESX 3.5i U3 loaded on each.

The setup:

DL380 G5

2x Quad core

2x 72GB Raid 1+0

6x 146GB Raid5

32 GB RAM

I installed esx 3.5i u3 as per normal, created vm's. But when I copy data from a network folder to the vm the cpu spikes and the vm is unresponsive. What makes it worse is if I copy a large file from the c: to another folder on c: the vm is unresponsive - cpu high and ping replies at about 200ms. the OS vmdk's are all on the 72GB Raid 1+0 datastore. When I add a data drive from the raid5 datastore the same happens. If I copy from the C to that drive on the raid 5 datastore the VM gets unresponsive and pings hover at about 200ms.

Any ideas?

I have no clue why this is happening, Ive deployed the same setup elsewhere with a bunch of vm's and no problems.

0 Kudos
29 Replies
kjb007
Immortal
Immortal

Without the cache working properly, your vm's will have to wait until the I/O completes before moving on to the next instruction, forming a queue. If you have large transfer, where lots of data is going through, it would make sense that your CPU keeps increasing trying to handle all of the I/O which would typically handed off to a cache to deal with.

-KjB

VMware vExpert 2009

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
0 Kudos
g1xx3rb0y
Contributor
Contributor

So basically the black battery attachment is the BBWC (Battery backed write Cache) - which does all the work that the CPU's were doing before? We received the servers prebuilt and didnt know it doesnt come standard to be honest! Im going to monitor and will let you guys know!!

0 Kudos
kjb007
Immortal
Immortal

Sort of. The CPU sends the I/O to the cache, the cache responds and says the I/O has been committed, continue processing. The cache keeps this in memory, and commits it to disk as quickly as it can. Since the cache is memory and is faster than disk, the CPU receives a response immediately to go on to the next transaction. Without the cache, the CPU has to wait until that transaction, I/O has been written to disk. If the continued I/O is faster than it's being written to disk, then a queue/backlog occurs, and work piles on. The CPU has to work harder to wait and respond as quickly as it can. You can see how this over time with a lot of data can stress a CPU.

-KjB

VMware vExpert 2009

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
g1xx3rb0y
Contributor
Contributor

Thanks for the explanation kjb.

Anycase....thanks all the other people who assisted and pointed me into the right directions, I've been copying like crazy and everything seems fine now again!!

Cheers

0 Kudos
CWedge
Enthusiast
Enthusiast

I'll put money on this.

disable all your USB controllers 1.0, 2.0 in the Bios, reboot and retry.

This is a known/dirty secret, that no one wants to acknowledge.

Due to the IRQ conflicts, The VMkernel (which only runs on CPU0) ends up using the same IRQ as say the NIC or Raid Card for something else and holds it too CPU0. All future functions of any cards are held to CPU) instead of using all available cores for IRQ functions.

You experience all the symptoms you have said. Especially seeing the machines hang.

After removing USB, we saw local disk file copies go from 70 MB/S to 444 MB/s.

Network speeds go from under 30MB/s (240mb/s) to full line speed of a gig connection.

0 Kudos
g1xx3rb0y
Contributor
Contributor

Interesting...we haven't had any copy problems since the controller changes, but I'll give it a shot and provide feedback. Any performance gain is always welcome!!!;)

0 Kudos
CWedge
Enthusiast
Enthusiast

did you try that?

--Craig

0 Kudos
g1xx3rb0y
Contributor
Contributor

No...unfortunately not, the servers are already in production - running like a dream - at the next patch weekend I'll make the changes - I will then update this post - should be next weekend or after.

Thanks again Craig !

Cheers

0 Kudos
g1xx3rb0y
Contributor
Contributor

Hi Guys,

I disabled the USB, I can honestly say that the copy speeds increased both local disk and network copy speed. I have since included this in my ESX setup doc and making the changes at all my clients.

Thanks for this excellent tip!!!

0 Kudos
CWedge
Enthusiast
Enthusiast

Glad to see it worked.

HP and VMware will not admit to this flaw. But it is real.

I went from 70MB/s transfer on a local 16Drive SAS Array to 333MB/s

and on my 1gbe, couldn't get over 280mb/s, now it saturates the 1gb at like 900mb/s

Vmotions now take seconds instead of minutes...

It's amazing.. like an over 5x total system performance gain.

If you had 10vms per host before, you could probably host 30 vm's now..

0 Kudos