I have a very very weird issue on 2x HP DL380 G5 boxes with ESX 3.5i U3 loaded on each.
2x Quad core
2x 72GB Raid 1+0
6x 146GB Raid5
32 GB RAM
I installed esx 3.5i u3 as per normal, created vm's. But when I copy data from a network folder to the vm the cpu spikes and the vm is unresponsive. What makes it worse is if I copy a large file from the c: to another folder on c: the vm is unresponsive - cpu high and ping replies at about 200ms. the OS vmdk's are all on the 72GB Raid 1+0 datastore. When I add a data drive from the raid5 datastore the same happens. If I copy from the C to that drive on the raid 5 datastore the VM gets unresponsive and pings hover at about 200ms.
I have no clue why this is happening, Ive deployed the same setup elsewhere with a bunch of vm's and no problems.
Without the cache working properly, your vm's will have to wait until the I/O completes before moving on to the next instruction, forming a queue. If you have large transfer, where lots of data is going through, it would make sense that your CPU keeps increasing trying to handle all of the I/O which would typically handed off to a cache to deal with.
VMware vExpert 2009
So basically the black battery attachment is the BBWC (Battery backed write Cache) - which does all the work that the CPU's were doing before? We received the servers prebuilt and didnt know it doesnt come standard to be honest! Im going to monitor and will let you guys know!!
Sort of. The CPU sends the I/O to the cache, the cache responds and says the I/O has been committed, continue processing. The cache keeps this in memory, and commits it to disk as quickly as it can. Since the cache is memory and is faster than disk, the CPU receives a response immediately to go on to the next transaction. Without the cache, the CPU has to wait until that transaction, I/O has been written to disk. If the continued I/O is faster than it's being written to disk, then a queue/backlog occurs, and work piles on. The CPU has to work harder to wait and respond as quickly as it can. You can see how this over time with a lot of data can stress a CPU.
VMware vExpert 2009
Thanks for the explanation kjb.
Anycase....thanks all the other people who assisted and pointed me into the right directions, I've been copying like crazy and everything seems fine now again!!
I'll put money on this.
disable all your USB controllers 1.0, 2.0 in the Bios, reboot and retry.
This is a known/dirty secret, that no one wants to acknowledge.
Due to the IRQ conflicts, The VMkernel (which only runs on CPU0) ends up using the same IRQ as say the NIC or Raid Card for something else and holds it too CPU0. All future functions of any cards are held to CPU) instead of using all available cores for IRQ functions.
You experience all the symptoms you have said. Especially seeing the machines hang.
After removing USB, we saw local disk file copies go from 70 MB/S to 444 MB/s.
Network speeds go from under 30MB/s (240mb/s) to full line speed of a gig connection.
Interesting...we haven't had any copy problems since the controller changes, but I'll give it a shot and provide feedback. Any performance gain is always welcome!!!;)
No...unfortunately not, the servers are already in production - running like a dream - at the next patch weekend I'll make the changes - I will then update this post - should be next weekend or after.
Thanks again Craig !
I disabled the USB, I can honestly say that the copy speeds increased both local disk and network copy speed. I have since included this in my ESX setup doc and making the changes at all my clients.
Thanks for this excellent tip!!!
Glad to see it worked.
HP and VMware will not admit to this flaw. But it is real.
I went from 70MB/s transfer on a local 16Drive SAS Array to 333MB/s
and on my 1gbe, couldn't get over 280mb/s, now it saturates the 1gb at like 900mb/s
Vmotions now take seconds instead of minutes...
It's amazing.. like an over 5x total system performance gain.
If you had 10vms per host before, you could probably host 30 vm's now..