After going through a series of IOmeter tests at a customer site I am left scratching my head. The customer originally contacted us to help figuring out why their storage system was delivering such bad performance, but we were quickly able to determine that the problem was higher up in the stack somewhere.
In our tests we saw that cleanly installed windows guests were able to deliver the expected performance - up to 400MB/s for sequential loads, and 10.000+ IO/s for random tests.
However, when running the same benchmark inside an existing windows VM (we tried both Win 2003 and Win XP guest) we were seeing just awful results - 60-70MB/s for sequential, and 300IOps for random. During this test we also saw that the reported latencies as seen from inside the guest were reaching into the hundreds - even for the sequential load. However, esxtop run on the ESX hosting these VMs reported response times in the single-digit order, so clearly the problem was not that we were in any way saturating the infrastructure.
Since we figured that this could be a driver and / or OS setting issue we then proceeded to change the SCSI device type for the OS disk which we were using for testing from lsilogic to buslogic. Lo and behold, now we were seeing the same excellent speeds as with our clean VMs.
So my first question would then be this: Does anyone know of a plausible reason why an old windows install has such huge problems generating disk IO when simply replacing the controller type fixes the problem? Going from lsilogic to buslogic and vice versa should only have negligible impact, and not the 10-15X improvement we are seeing.
Now for the second part of the issue: The customer in question also runs a large amount of Linux guests (CentOS/RedHat 5.X) and we saw the same results when testing these. Unfortunately we were not able to demonstrate any "good" test runs; even a cleanly installed CentOs 5.3 had more or less the same performance as their existing VMs, and it was frankly lousy. We also tried a completely different linux distribution just to eliminate that particular kernel/library combination as the culprit, but with no change.
When testing the Linux guests, as with their old Windows guests, we were seeing a huge discrepancy between reported latencies from the guest, and what esxtop was telling us. Clearly this is relevant, and I am hoping that someone has an explanation and possible fix for this.
As a final FYI we did try using the divider=10 to reduce the interrupt rate of the Linux VMs but this did not have any measurable impact.