Solved: Re: Poor ESXi 4 NFS Datastore Performance with Var... - Page 2

obstmassey · ‎04-08-2010

Hello!

In testing, I have found that I get between one half and one quarter of the I/O performance inside a guest when the ESXi 4 systems connect to the datastore using NFS than if the guests connect to the exact same NFS share. However, I do not see this effect if the datastore uses either iSCSI or local storage. This has been reproduced with different systems running ESXi 4 and different NAS systems.

My testing is very simple. I created a bare minimum CentOS 5.4 installation (fully updated as of 2010/04/07) with VMware Tools loaded, and time the creation of a 256MB file using DD. I create the file on the root partition (a VMDK stored in various datastores), or on a directory from the NAS mounted via NFS directly into the guest

My primary test configuration consits of a single test PC (Intel 3.0GHz Core2 Duo E8400 CPU with a single Intel 82567LM-3 Gigabit NC and 4GB RAM) running ESXi 4 connected to a HP Procurve 1810-24G, which is connected to a VIA EPIA-M700 NAS system running OpenFiler 2.3 with two 1.5TB 7200RPM SATA disks configured for software RAID 1 and dual bonded Gigabit Ethernet NICs. However, I have reproduced this with different ESXi PC's and different NAS systems.

Here is an output from one of the tests. In this case, the VMDK's are in a datastore stored on the NAS via NFS:

-

~~root@iridium /~~# sync; sync; sync; time { dd if=/dev/zero of=test.txt bs=1M count=256; sync; sync; sync; }
2560 records in
2560 records out
268435456 bytes (268 MB) copied, 0.524939 seconds, 511 MB/s
real 0m38.660s
user 0m0.000s
sys 0m0.566s
~~root@iridium /~~# mount 172.28.19.16:/mnt/InternalRAID1/shares/VirtualMachines /mnt
~~root@iridium /~~# cd /mnt
~~root@iridium mnt~~# sync; sync; sync; time { dd if=/dev/zero of=test.txt bs=1M count=256; sync; sync; sync; }
2560 records in
2560 records out
268435456 bytes (268 MB) copied, 8.69747 seconds, 30.9 MB/s
real 0m9.060s
user 0m0.001s
sys 0m0.659s
~~root@iridium mnt~~#

-

The first dd is to a VMDK stored in a datastore connected via NFS. The dd completes almost immediately, but the sync takes almost 40 seconds! That's less than 7MB per second transfer rate: very slow. Then, I mount the exact same NFS share that ESXi is using for the datastore directly into the guest and repeat the dd. As you can see, the dd takes longer and the sync takes no real time (as it should for a NFS share with sync enabled), and the entire process takes less than 10 seconds: it's four times faster!

I only see these results on datastores mounted via NFS. For example, here is a test run on the same guest running from a datastore mounted via iSCSI (using the exact same NAS):

-

~~root@iridium /~~# sync; sync; sync; time { dd if=/dev/zero of=test.txt bs=1M count=256; sync; sync; sync; }
2560 records in
2560 records out
268435456 bytes (268 MB) copied, 1.6913 seconds, 159 MB/s
real 0m7.745s
user 0m0.000s
sys 0m1.043s
~~root@iridium /~~# mount 172.28.19.16:/mnt/InternalRAID1/shares/VirtualMachines /mnt
~~root@iridium /~~# cd /mnt
~~root@iridium mnt~~# sync; sync; sync; time { dd if=/dev/zero of=test.txt bs=1M count=256; sync; sync; sync; }
2560 records in
2560 records out
268435456 bytes (268 MB) copied, 8.66534 seconds, 31.0 MB/s
real 0m9.081s
user 0m0.001s
sys 0m0.794s
~~root@iridium mnt~~#

-

And the same guest running from the internal SATA drive of the ESXi PC:

-

~~root@iridium /~~# sync; sync; sync; time { dd if=/dev/zero of=test.txt bs=1M count=256; sync; sync; sync; }
2560 records in
2560 records out
268435456 bytes (268 MB) copied, 6.77451 seconds, 39.6 MB/s
real 0m7.631s
user 0m0.002s
sys 0m0.751s
~~root@iridium /~~# mount 172.28.19.16:/mnt/InternalRAID1/shares/VirtualMachines /mnt
~~root@iridium /~~# cd /mnt
~~root@iridium mnt~~# sync; sync; sync; time { dd if=/dev/zero of=test.txt bs=1M count=256; sync; sync; sync; }
2560 records in
2560 records out
268435456 bytes (268 MB) copied, 8.90374 seconds, 30.1 MB/s
real 0m9.208s
user 0m0.001s
sys 0m0.329s
~~root@iridium mnt~~#

-

As you can see, the direct guest NFS performance for all three is very consistent. The iSCSI and local disk datastore performance are both slightly better than this--as I would expect. But the datastore mounted via NFS gets only a fraction of the perfomance of any of these. Obviously, something is wrong.

I have been able to reproduce this effect with an Iomega Ix4-200d as well. The difference is not as dramatic, but still sizeable~~and consistent. Here is a test from a CentOS guest using a VMDK stored in a datastore provided by an Ix4-200d via NFS:~~---

~~root@palladium /~~# sync; sync; sync; time { dd if=/dev/zero of=test.txt bs=1M count=256; sync; sync; sync; }
2560 records in
2560 records out
268435456 bytes (268 MB) copied, 11.1253 seconds, 24.1 MB/s
real 0m18.350s
user 0m0.006s
sys 0m2.687s
~~root@palladium /~~# mount 172.20.19.1:/nfs/VirtualMachines /mnt
~~root@palladium /~~# cd /mnt
~~root@palladium mnt~~# sync; sync; sync; time { dd if=/dev/zero of=test.txt bs=1M count=256; sync; sync; sync; }
2560 records in
2560 records out
268435456 bytes (268 MB) copied, 9.91849 seconds, 27.1 MB/s
real 0m10.088s
user 0m0.002s
sys 0m2.147s root@palladium mnt--#

-

Once again, the direct NFS mount gives very consistent results. But using the disk provided by ESXi on a NFS mounted datastore gives consistently worse results. They're not as terrible as the OpenFiler test results, but they are consistently between 60% and 100% longer.

Why is this? From what I've read, NFS performace is supposed to be within a few percent of iSCSI performance, yet I'm seeing between 60% and 400% worse performance. And this is not a case of the NAS not being able to provide decent NFS performance. When I connect to the NAS via NFS directly inside of the guest, I see dramatically better performance than when ESXi is connecting to the same NAS (the same share!) via NFS.

The ESXi configuration (e.g. network and network adapters) is 100% stock. There are no VLAN's in place, etc., and the ESXi system only has a

single Gigabit adapter. This is certainly not optimal, but it does not seem to me to be able to explain why a virtualized guest is able to get so much better NFS performance than ESXi itself to the same NAS. After all, they are both using the exact same sub-optimal network setup...

Thank you very much for your help. I would appreciate any insight or advice you might be able to give me.

J1mbo · ‎04-19-2010

It seems that your mind is made up. But try changing the scheduler on the NFS box to noop.

As I mentioned above, but the partiton needs to be created aligned properly on the NFS server, but much more importantly the guest partition also needs to be 4K aligned.

There are some other tweaks too but I'm saving those for my book

Please award points to any useful answer.

obstmassey · ‎04-19-2010

I'm willing to try--but probably not enough to buy your book!

Your changes are tweaks. They certainly might be able to get back some of the 15% or so performance drop I'm seeing with the right hardware, but it's not going to help the 50% to 75% performance drop I'm seeing on commodity hardware!

To quote Michael Abrash, "Profile before you optimize." I'm not trying to wring out the last few percent of performance. I'm trying to figure out why performance with NFS datastores on commodity hardware falls into an abyss. noop ain't gonna fix that...

And for the record, here's the result with elevator=noop added to the kernel command line:

~~root@iridium /~~# sync; sync; sync; time { dd if=/dev/zero of=test.txt bs=1M count=256; sync; sync; sync; }

256+0 records in

256+0 records out

268435456 bytes (268 MB) copied, 23.0501 seconds, 11.6 MB/s

real 0m37.009s

user 0m0.000s

sys 0m0.781s

~~root@iridium /~~# mount 172.28.19.16:/mnt/InternalRAID1/shares/VirtualMachines /mnt

~~root@iridium /~~# cd /mnt

~~root@iridium mnt~~# sync; sync; sync; time { dd if=/dev/zero of=test.txt bs=1M count=256; sync; sync; sync; }

256+0 records in

256+0 records out

268435456 bytes (268 MB) copied, 8.87204 seconds, 30.3 MB/s

real 0m9.116s

user 0m0.000s

sys 0m0.349s

~~root@iridium mnt~~#

No meaningful difference.

J1mbo · ‎04-19-2010

I never said the book would cost

BTW, how are you running dd? Is this from ESX console, or within a VM?

I just added a SATA drive to a Debian VM providing NFS shares to ESXi. Using IO-Meter in the Windows VM I get read or write at about 40 MB/s. 8K random with 2GB test file and 8 IOs outstanding gives about 115 IOPS at 66ms (70:30 read:write).

Deliberately misaligning the guest read or write is about 27 MB/s and 80 IOPS at 100ms on the same basis.

All on quite a full WD Caviar-green so numbers look OK to me.

Re command scheduler, switching it from cfq to noop for a device with IO-Meter ultimately running against provided by a SATA RAID-10 volume on a Perc-5 doubles the sequential throughput on my test box...

Please award points to any useful answer.

obstmassey · ‎04-20-2010

These are stock ESXi systems: no console, and not using the "unsupported" console. The dd's are all from inside a CentOS 5.4 guest. The first dd is testing writing to the guest's local VMDK-based drive, and the second one is testing writing to the same NFS share as ESXi is using to provide the VMDK file. Additional details are all in the first post of this thread. What I'm trying to find out is why a guest can get four times the performance to an NFS share directly that an ESXi datastore gives the exact same guest using a disk image.

I don't know what you mean by "add[ing] a SATA drive to a Debian VM". How do you add a physical drive to a VM? Did you mean to a Debian NAS? If so, that's about the performance I would expect from a single SATA spindle. I can easily get more than that directly from my simple NAS (dual 1.5TB7.2k SATA drives in RAID1) via NFS, but not from within an ESXi datastore mounted via NFS.

Also, I really don't think that the problem is related to dd. What got me going in this direction was that I found that my personal VM was running very slowly after I moved it from a datastore on a local SATA drive attached to an ESXi system to a OpenFiler NAS. I traced it down to terrible disk performance within the VM. The dd measurements parallel the results I was getting in the Windows VM, and it makes for a real easy way to test NFS connections.

Do you mean Iometer from http://www.iometer.org ? I've got a Windows VM on this system: I'll give this a try and see if I can compare to the results you've gotten.

mike_laspina · ‎04-20-2010

Hi,

One thing you could explore is the possiblility that your network adapter is the perf issue and not the disk storage subsystems.

I see this all to often where we don't check for shared IRQ's with slow devices or do not validate the end to end networking performance.

I suggest you check your network throughput with iperf.

vExpert 2009

http://blog.laspina.ca/ vExpert 2009

obstmassey · ‎04-20-2010

If I get a chance, I will test the network performance, but it is highly unlikely to be network performance for the following reasons:

The guest gets correct performance connecting to NFS directly
ESXi gets correct performance when connecting to iSCSI datastores
ESXi gets mostly-correct performance when connecting to NFS shares backed by a RAID controller with BBC

To me, it seems very clearly tied to ESXi and NFS.

mike_laspina · ‎04-20-2010

I see.

That would be enough evidence to indicate your network is working normally, however it does not eliminate poor latency which is what NFS o_sync will suffer to perform on. Small NFS I/O sync requests will not Q up like it does on a block device such as iSCSI it will demand immediate acknowledgment that the operation is complete. An 5ms network delay is eternal time when NFS is asking the storage provider to assure a write op is complete at the disk level.

Also if the interface ring is saturated due to poor IRQ handling the effect is compounded.

vExpert 2009

http://blog.laspina.ca/ vExpert 2009

obstmassey · ‎04-20-2010

Here's the results, with iperf -s running on the commodity OpenFiler box and iperf -c running on the Linux guest:

# iperf -c 172.28.19.16

-

Client connecting to 172.28.19.16, TCP port 5001

TCP window size: 16.0 KByte (default)

-

local 172.28.16.100 port 38415 connected with 172.28.19.16 port 5001

Interval Transfer Bandwidth

0.0-10.0 sec 816 MBytes 684 Mbits/sec

Seems OK to me. Not great, but far from a problem. Is there further tests or other iperf parameters you would like me to try?

mike_laspina · ‎04-20-2010

Hi,

Yes it's not a perfect result, there is some indication of latency when it reports 816MB transferred but reports 681Mbits/s, thus it has some pauses in the Ethernet flow.

If latency was low the two numbers would be a little closer with the standard 8B/10B encoding. The frame overhead would normally be about 5% and your reporting about 10% so you have a 5% variance.

This will impact NFS perf but certainly not as much as your experiencing so there is some other factors as well like share buffers, CPU load etc.

vExpert 2009

http://blog.laspina.ca/ vExpert 2009

obstmassey · ‎04-21-2010

I have a feeling that the poor network performance is from the NAS (a VIA-based system) on the other end of the iperf test, not theESXi system. So, I reran the iperf test, with the ESXi system against the IBM x236. Better results:

# iperf -c 172.28.19.17

-

Client connecting to 172.28.19.17, TCP port 5001

TCP window size: 16.0 KByte (default)

-

local 172.28.16.100 port 41688 connected with 172.28.19.17 port 5001

Interval Transfer Bandwidth

0.0-10.0 sec 1.09 GBytes 935 Mbits/sec

#

Another strike against commodity (or in this case, embedded) hardware!

Next I'll try putting OpenFiler on an Intel-based machine with either an Intel or Broadcom NIC and see what the numbers are. I have a strong suspicion that we will see similarly-improved iperf numbers, but still see terrible NFS performance. If I do, I think we can put the nail in the coffin of any other excuse besides terrible NFS performance...

Unless someone else has any other bright ideas? I've come this far, and I've got the lab set up. If it isn't too difficult, I'll test it.

Frankly, at this point, I've virtually written NFS off. The only reason I was considering it was for simple full-VM backup. I've found that I can achieve the exact same thing with iSCSI using the (unsupported) ESXi console in the lab (and for production I can use the ESX console or VCB). Looks like iSCSI is the clear winner on commodity hardware.

J1mbo · ‎04-21-2010

Install Debian on a reasonable machine with some decent drives. Set up carefully throughout it will easily saturate GigE for read and write.

To me the biggest issue with NFS is the 4K allocation unit, which makes alignment of guest partitions absolutely critical.

Please award points to any useful answer.

obstmassey · ‎04-21-2010

If by "reasonable machine" you mean something with a battery-backed cache, I'd agree~~mostly. If you mean you can do it without it, I have seen no evidence that it's possible~~and I've tried.

J1mbo · ‎04-21-2010

I agree, but that is a function of the ESX approach to storage in general. Even DAS without BBWC with a parity-RAID is often cited on here at < 10MB/s seq.wr. Single SATA drives tend to do rather better in this respect.

Please award points to any useful answer.

obstmassey · ‎04-21-2010

So we all agree: Battery Backed Cache is important for ESXi in general, but essential for NFS in particular.

Would have been more useful if you had said that instead of asking me to change I/O schedulers and worry about alignment! I was very clear that my testing was on commodity hardware (and specified as SATA/software RAID1).

As long as you are using NFS without a BBC, performance will be a small fraction of what the hardware is capable of delivering. (Can I mark my own answer as correct / helpful? )

However, if there is anything else that someone wants to suggest regarding other ways BESIDES BBC to improve NFS performance with ESXi I am all ears--and my test machines stand ready to go!

alubel · ‎04-21-2010

IME, BB[w]C doesn't speed up o_syncs as it would be breaking rules when mounting the datastore via nfs. The only way to speed up syncs is to use faster storage hardware. You cant expect a volkswagen to perform like a porsche!

Again, I suggest using vdbench or even iozone locally and force direct/IO. If not then maybe load a windows vm and try using iometer or atto diskbench.

When I get a chance ill post my numbers for nfs attached vmdk, local vmdk as well as vm attached nfs.

If you are trying to do stuff on the cheap for nfs storage, I suggest using zfs backed storage where you can have logzilla (aka writezilla) via 1-2 SSD's.

J1mbo · ‎04-22-2010

Sorry but something is a miss here; BBWC does exactly that (speed up writes) since from the NFS servers perspective the storage hardware completes the write with near zero latency, the controllers caching and coallescing converting individual block writes (read-update-write) to full-stripe writes. This should be faster than an SSD since it's completed at DRAM speeds. To get the write performance from ESX the export needs to be async too.

NFS tuning for a particular server is vital, but retesting this in some depth actually I can't get much improvment in writes. Reads and IOPS are a different storey - that Debian install runs at 60MB/s and 380 IOPS 'out the box', rising to 115MB/s and 480 IOPS with some tuning

Please award points to any useful answer.

J1mbo · ‎04-22-2010

By the way, re alignment, it really is important and I can't emphasise this enough. Consider what's happening with a mixed workload (70:30 I use for testing read:write) running from a guest with misaligned partition relative to NFS blocks.

As everyone agrees, ESX will complete writes serially. The guest asks to update an 8K block that staddles three 4K NFS blocks. ESX reads the three 4K blocks, updates with the new 8K worth and then writes back the three blocks. This obviously adds a massive penalty to writes (physical disk seek+latency regardless of BBWC since the read needs to be physically delivered before the write can commence) but also completely blocks IO in the process. This effect can be seen on OpenFiler, for example, running iostat from SSH and looking at avgqu-sz which will be close to 1 during such testing. With a queue depth of only 1 the array will perform pretty much like a single disk running read-update-write, regardless of how many physical disks are present. Because of this the impact gets worse, relatively, as the number of disks are increased - my own benchmarking run against a 6x 15k SAS RAID-5 array (Perc-6i, 256MB BBWC, OpenFiler) showed 220 IOPS misaligned, rising to over 1200 with everything set up properly

Anyway, HTH

Please award points to any useful answer.

obstmassey · ‎04-22-2010

Remind me again what the advantage of NFS is?

Seriously, the only advantages I saw with NFS were all on the low end: not as low as pre-boxed NAS systems that you get from a retail store, but lower than a typical entry-level EMC or NetApp setup. However, by the time you build a storage unit that will make ESXi happy, AND make sure that you properly align every last guest (and that means aligning it with the NFS block size, the filesystem block size, and the RAID array block size), you've eaten up a lot of the initial savings by avoiding a "real" storage array in the first place.

The only area I see being easier with NFS than with iSCSI is backup of VM's outside of VMware. That is really nice, but not that nice! And I can mostly get around that, even with iSCSI.

So, due to the performance issues, seeing as you can't actually use a low-end NFS-based NAS system for anything but the most trivial uses, what are the advantages of NFS on anything besides a NetApp Filer?

mike_laspina · ‎04-22-2010

Most definitely agreed! SSD/Cache is a must for NFS to perform effectively when used as ESX stores.

BTW: SUN s7000 boxes perform equally and better with some loads than NetApp filers!

vExpert 2009

http://blog.laspina.ca/ vExpert 2009

J1mbo · ‎04-22-2010

The alignment issue will become less important since Win2k8 and Vista+ align properly out the box (not sure about linux distros).

Pros and cons on both sides - iSCSI has no delete command, so LUNs thin-provisioned at the storage layer can only ever grow. iSCSI will also generally cost (even software solutions) whilst NFS is... well, free. On the other hand iSCSI is somewhat simpler precisely because it is block level.

NFS is, IMO, a great solution for DR sites, where it's scale-out capability and minimal cost (if redeploying hardware) can be a real benefit.

Please award points to any useful answer.

All

Poor ESXi 4 NFS Datastore Performance with Various NAS Systems