Performance meters NOT accurate / CPU overhead

RParker · ‎10-26-2010

I have been tweaking and testing many different tools.

The latest is Vizioncore newest version of vFoglight. (no this isn't a vFoglight endorsement).

What I found is vCenter is TOTALLY bogus info. I thought at first the numbers showing in foglight were wrong, because surely.. VM Ware CAN'T make bad tools, can they? I mean it's THEIR product.

Well doing various calculations, such as a 36 GB file in 9 minutes, vCenter says it's 15MB/s. Umm.. WRONG!

Well that's ok, thank goodness for vKernel / Vizioncore for proving vCenter is vastly inept for performance reporting purposes.

Also if you turn OFF performance statistics, vCenter runs better. At least that's what I believe, so apparently those statistics take some cycles.

Bottom line is if you want TRUE performance reports of your vCenter / ESX use 3rd party tools, AND you can alleviate overhead on vCenter by turning OFF those inaccurate performance statistics anyway.

jhanekom · ‎10-28-2010

OK... I'll bite.

Out of curiosity:

can you please describe the nature of the file you were copying?
which metric was reporting 15MB/s?
what was vFoglight saying for the same period, same metric?
was vFoglight monitoring concurrently, or a separate iteration of the test?
if available, what was esxtop saying for that metric?

vCenter does have its flaws, but if reported they're usually corrected. Case in point: reversal of network send/receive counters in vCenter 4, which has since been fixed.

RParker · ‎10-29-2010

can you please describe the nature of the file you were copying?

A VM, migrate from one ESX host to another. ALL 1GB NIC, ALL same segment, ALL same ESX build

which metric was reporting 15MB/s?

Network utilization on the source and destination host. The calculation is flat out wrong.

what was vFoglight saying for the same period, same metric?

I have long suspected vCenter was inaccurate, because the math never works out. If a file takes 9 minutes to copy, and its 36GB, how can that be 15 MB/s speed? Obviously not even close to being accurate. Same with the disk usage as well on Vizioncore backup, I can do a backup in a certain amount of time, but if I watch the ESX where the backups are taking place and the ESX where the VM resides they show a different value than what is actually being calculated. Besides which 2 different appliances were reporting the same thing (not to mention the value that I can do with a simple calculator) are ALL showing vCenter is wrong. Maybe it's right for SOME things, but if it's not right for EVERY performance meter, then that's suspect for EVERYTHING, since it's not reliable.

was vFoglight monitoring concurrently, or a separate iteration of the test?

Same time.

if available, what was esxtop saying for that metric?

That's the other thing, on vCenter watching one metric for NIC let's say. And disk transfer is the other metric. How can a Network transfer show 40MB/s yet disk show something that doesn't make any sense.. I am aware of caching, but to be like 20 or 30MB/s off? Cache can do SOME of that but not that much. Also esxtop (which is more accurate, and reliable) is showing differently values that were more reasonable than vCenter.

This is a NEW vcenter Install, as recently as maybe 3 or 4 months, and I have noticed with previous vCenter installs as well, but I never really investigated until recently when I was trying to isolated another issue with vCenter and then I started getting this information.

On a side note, I disabled performance stats, and now vCenter runs better (maybe it's my imagination) but I would rather have the performance of vCenter and let a 3rd party do the metrics, even if all things are equal, letting a 3rd party tool gather metrics is better than vCenter (especially when there is an impact on performance) makes more sense.

And another side, side note... I just realized that vCenter status hardly ever (and there are multiple posts on this) is error free. There is always LDAP, rollup stats, or heal status error messages, that bugs me that even on a new install and reinstalls and after following instructions how to fix it, there are problems around the performance stats, so that to me shows there are glaring flaws, and this is just one incident.

ThompsG · ‎10-29-2010

Just thinking out of the box, but could it be because the "realtime" stuff you see if vCenter is actually over a 20 sec period? Ready time in vCenter is a classic example of this that often throws people out.

Kind regards.

jhanekom · ‎10-31-2010

To keep things focused, I'm going to stick to the core of the original issue: incorrect statistics while doing a cold VM migrate between two hosts.

I've conducted some tests, and my conclusion is that what you are seeing is not due to reporting anomalies (either in vCenter, vFoglight or esxtop), but purely because of how ESX functions. Details follow. (Feel free to conduct similar tests, maybe there’s something in your setup that cause different results.)

Also, I’m still not 100% sure on some of your answers in the previous post:

vFoglight was monitoring at the same time as vCenter/esxtop. What was vFoglight reporting in comparison to vCenter?
I'm not clear on your answer about what esxtop was saying at the time in comparison to what vCenter was saying. Can you please test again and report findings?

To investigate, I did the following:

1. Created a VM ("migratetest1") on host1 on local storage

10GB disk (zeroedthick, no data, just the blank disk)

2. Created a VM ("migratetest2") on host1 on local storage

10GB disk (eagerzeroedthick, no data, just the blank disk)

3. Created a VM ("migratetest3") on host1 on local storage

10GB disk (zeroedthick, filled with random data generated on another VM)

4. Created a VM ("migratetest4") on host1 on local storage

10GB disk (eagerzeroedthick, filled with random data generated on another VM)

5. Opened esxtop on both hosts and switched to networking

6. Test 1: migrate "migratetest1" from host1 to host2

7. Test 2: migrate "migratetest2" from host1 to host2

8. Test 3: migrate "migratetest3" from host1 to host2

9. Test 4: migrate "migratetest4" from host1 to host2

Notes

The esxtop values and – to a lesser extent – the vCenter values, are tough to average, but the values below are fairly representative

Results: Test 1

Time to migrate: 65s (~160MB/s)
esxtop network transfer rate: 0MB/s
esxtop disk transfer rate: 0MB/s (brief spikes @ 5MB/s)
vCenter network transfer rate: 0MB/s
vCenter disk transfer rate: near-0MB/s

Results: Test 2

Time to migrate: 55s (~186MB/s)
esxtop network transfer rate: 0MB/s
esxtop disk transfer rate: ~44MB/s read host1 / ~41MB/s written host2
vCenter network transfer rate: 0MbBs
vCenter disk transfer rate: ~44MB/s read host1 / ~41MB/s written host2

Results: Test 3 <-- this is an interesting one

Time to migrate: 538s (~19MB/s)
esxtop network transfer rate: ~160Mb/s transmit host1 / ~160Mb/s receive on host 2
esxtop disk transfer rate: ~20MB/s read host1 / ~40MB/s written host2
vCenter network transfer rate: ~20MB/s transmit host1 / ~20MB/s receive host 2
vCenter disk transfer rate: ~20MB/s read host1 / ~40MB/s written host2

Results: Test 4

Time to migrate: 275s (~37MB/s)
esxtop network transfer rate: ~340Mb/s transmit host1 / ~340Mb/s receive on host 2
esxtop disk transfer rate: ~42MB/s read host1 / ~42MB/s written host2
vCenter network transfer rate: ~42MB/s transmit host1 / ~42MB/s receive host 2
vCenter disk transfer rate: ~42MB/s read host1 / ~42MB/s written host2

Observations:

Test 1: Under certain circumstances (notably "lazy" zeroedthick), ESX knows which blocks are zero and won't read them from disk
Test 2: Cold migration isn't completely dumb, and does at least some form of run-length encoding, even if the blocks are read from disk
Test 3: With "lazy" zeroedthick disks, it seems that the destination write rate is twice the source. This is likely due to the destination blocks being zeroed first

Conclusions:

vCenter stats are fine: they match what is seen in esxtop and also what I would expect.
These test confirm what is reasonably well known about behaviour around uninitialised blocks
I did not expect that cold migrate would exhibit intelligence in transferring data over the network by doing zero-compression (either that or full RLE – haven’t confirmed)
I did not expect that "lazy" zeroedthick blocks would be zeroed every time they're moved, even if they've already got data in them; after all, if it knows which blocks are zero when reading data from disk, surely it should know that it doesn't need to zero them at the destination...
The "anomalies" you saw in your cold migration are likely due to the 36GB OS disk being populated by only a small amount of data; the disk is likely zeroedthick (not eagerzeroedthick) and has not been written to extensively; my guess is an OS instance of about 8GB, which would tie up closely with your migration times

All

Performance meters NOT accurate / CPU overhead