NFS performance

ehall · ‎10-28-2010

On my test setup, NFS beats iSCSI by about 10% but it's still not as fast as the back-end infrastructure allows. Iozone with a large test file shows that the local RAID array on the storage server is able to sustain >950 Mb/s of writes and >2.5 Gb/s of reads (all numbers are bits not bytes), while TTCP tests show that the ESXi host and the Linux storage server can push >980 Mb/s of network traffic each direction (they are next to each other in the rack, with a crossover cable connecting unrouted dedicated interfaces for storage traffic).

Using Iozone with somewhat smaller test files (2x the VM memory), opensuse VMs with the VMDK on the NFS volume are able to sustain 400 Mb/s writes and 560 Mb/s reads. That's pretty good, but it's only half of what is available to it.

Worse is that XP/SP3 VMs with the VMDK on the NFS volume are only able to sustain ~240 Mb/s on writes and 420 Mbs/ on reads, or about half of the Linux VMs. If I load up 4 of the XP VMs and run the Iozone tests simultanously, overall throughput only goes back up to the Linux level.

It would seem that I am hitting some kind of limit here. My feeling is that something with the NFS session is preventing better performance but I'm not sure where to begin looking. I am able to run Iozone from the ESXi console against the NFS store but the patterns are very odd and do not jibe with the guest performance data so I'm not sure what's going on there. I am doing more tests before publishing the numbers. Any ideas here? It's not network bandwidth or latency--I'm able to saturate the wire and ping times are 0.3 ms (300 nanoseconds).

Also, are there any tricks for improving the XP VMDK performance on NFS? I would like to get that closer to par with the Linux boxes.

Thanks

AWo · ‎10-29-2010

Is the NFS export set to sync or async? Async is much faster but not so secure regarding writes.

AWo

VCP 3 & 4

\[:o]===\[o:]

=Would you like to have this posting as a ringtone on your cell phone?=

=Send "Posting" to 911 for only $999999,99!=

vExpert 2009/10/11 [:o]===[o:] [: ]o=o[ :] = Save forests! rent firewood! =

ehall · ‎10-29-2010

The export is tweaked for performance with "rw,no_root_squash,no_subtree_check,async"

I got curious and mounted the datastore export from inside one of the Linux guests using the vmnic/vswitch data interface, then ran the Iozone tests against that mount point (as opposed to testing the "local drive" performance from vmkerrnel's NFS mount). Writes get 640 Mb/s and reads saturate the wire at 960 Mb/s. This is with no additional tweaking.

Searching for other posts on this topic, I see that I am one of hundreds with this problem. I think at this point it is pretty much proven that the vmkernel has some problems and I am unlikely to get any better numbers. What's interesting is that iSCSI performance is also choked down, so it's not just a problem with the NFS implementatino but instead appears to be some kind of datastore transport limitation.

RParker · ‎10-29-2010

so it's not just a problem with the NFS implementatino but instead appears to be some kind of datastore transport limitation.

I have been saying this for more than 3 years now.. they VM Ware MUST be limiting bandwidth somewhere.

malaysiavm · ‎10-29-2010

I am waiting them to support NFS v4

Craig

vExpert 2009 & 2010

Malaysia VMware Communities -

Craig vExpert 2009 & 2010 Netapp NCIE, NCDA 8.0.1 Malaysia VMware Communities - http://www.malaysiavm.com

RParker · ‎10-29-2010

As a Windows tech going way back to early 90's, I have seen many an OS come and go. I have seen many flashes by some OS, but then they just die on the vine, with no reason and they seem to get stalled at some point

OS/2

NextWave

BeOS (yes it may still be here, but not developed)

And a few others I can't remember.

The point is, only one stands out, and I really hate to say it (or prove it) but it appears to be true even NOW.

Windows has stood the test of time, case in point. NFS. Windows supports NFS v4. Windows can host NFS data, we don't see this performance degradation on Hyper-V VM's running even unsupported Redhat (Hyper-V supports SUSE).

So now where are we? People bash Microsoft, for many things, you can say whatever you want, but even Linux has some shortcomings for many things, and when they get a stable OS, they forget the rest of the OS and don't care if it's completely done.

Windows has never stalled, maybe it hasn't been great, had a few black eyes, but it didn't stop MS from getting better. Windows Me and Windows Vista (maybe Dos 4) were horrible OS, but Windows 7 / XP have been the most stable and powerful to date. So when these things don't work in an Enterprise product like ESX I have to question what they are thinking?

Microsoft is just looking for ANY excuse to eat their lunch. Apparently VM Ware is content with that, because like you said.. NFS v4 isn't supported, and why not? It's been out a while. Why are we stuck with 2TB LUN limits.. Windows isn't, LUN yes, but ESX has limits for the number of 2TB LUN it can have.. WHY?!? Windows has very high limits well beyond what ESX can handle, so what I don't understand is what gives? I have to believe they are happy (or complacent with their CURRENT) Virtualization ranking, because it won't last much longer if this keeps up.

I am not a Apple fan (not against it either) but many people complain that there is no VI client for Apple, I am not for or against this argument, but certainly even Microsoft has Apple software, what does that tell you? MS seems to listen to their customers, VM Ware is concerned only with their own agenda, it's becoming more and more clear.

wobbe98 · ‎10-29-2010

Perhaps your partitions are not properly alligned.

http://www.vmware.com/pdf/esx3_partition_align.pdf

Or perhaps there is a difference between the windows and the linux version of iozone.

ehall · ‎10-29-2010

The data is on a RAID-10 array that was wholly assigned to LVM. All data is in zero-aligned 4MB block and there are no partitions to mis-align. Linux and Windows VMDK files are read from the same directory tree in the same logical volume on that array.

It's possible that there are differences in the iozone builds (or maybe an error in the Windows port), however the tests don't use Windows application space so much as they report on local disk performance~~files are written and read, and times are recorded~~so it seems unlikely. But it's certainly possible that Iozone under Windows is throwing away half the requests or something like that.

Right now I am exploring different SCSI drivers to see about differences there. I thought maybe the default disk I/O parameters might be causing problems or that there would be some well-known performance tweaks. Thanks for the reply.

J1mbo · ‎10-29-2010

I just wanted to comment on RP-Parker's post.

I really agree with the points about vmware being blinkered by their success. The 2TB limit in particular is a complete joke.

However Microsoft are themselves seemingly on a self-destruct mission too..., all will change in the next decade I feel!

http://blog.peacon.co.uk

Please award points to any useful answer.

Unofficial List of USB Passthrough Working Devices

ehall · ‎10-29-2010

Two changes with some results.

First, I rebuilt the drives/VMDKs using LSI Parallel controllers, which seems to have helped a great deal (they were originally imported from VM Server and had been whacked at quite a bit). I also tested with the Paravirtual SCSI controller, and while it yielded consistently better data it wasn't a huge increase (maybe 2-5%) which does not justify the extra difficulty in managing the systems.

Second, I read through the link from wobbe98 which advised aligning the the guest partitions too, but that did not seem to make any statistical difference except that cached operations fell a bit (probably because fewer underlying blocks and stripes being processed). I'll have to look into that more. It may be that I can improve things by boosting write-ahead, or by using larger blocks, which would mimic some of the earlier sloppy behavior.

For the Linux VMs, local disk performance increased by 60% just by rebuilding the drives (2.6 kernel drivers). I'm currently pushing 650 Mb/s on writes and 900 Mb/s on reads, which is pretty good. This puts the write performance of the vmkernal NFS above the raw NFS mount write numbers, and close to same with the read numbers. I'm still missing about 30% of the write capacity but at this point I'm much less worried about it.

The Windows VMs also improved from recreating the drives (LSI provided drivers), but only by 15-20% and that is only if you look at the data cross-eyed. Write performance stabilized but did not go much higher than it was originally, while the read performance did improve noticeably in some areas. The performance is clearly better but it's still bad, and still below the original Linux numbers. I bet if I bumped up cluster size to 64k it would jump, but I'd like to know what's holding it back right now.

Keep the ideas coming

J1mbo · ‎10-30-2010

The big problem I see here is that the testing methodology is looking only at sequential throughput.

I've been using NFS quite a bit and have found that the underyling NFS server configuration is hugely important. File system choice is also critical in some configurations as well as the partition (or volume) mount options. In general I tend to mount noatime,nodiratime and fs specific such as data=writeback, nobarrier etc (but my machines have both BBWC and UPS). When formatting ext3/4 can tune it to the underlying RAID volume, for XFS can specify bigger log buffers.

The NFS server can also be tweaked, for example providing enough threads, 256k window sizes, and setting the IO scheduler depending on the hardware (use noop for RAID controllers as these will do the re-ordering etc).

However IMO the testing needs to be focused on random workloads, for which I use IOmeter. Testing 8K random, say 70% read, over a good size test file to avoid cache (maybe 8GB, maybe 30GB depending) with various queue depths will show (possibly profoundly) the effect of guest partition alignment. Set the access pattern to be 4KB aligned (presumably your guest file systems are using 4KB blocks) and hit these with moderate queue depths (say 16 IOs) against aligned and unaligned partitions.

Another issue I've come up against recently is the interaction between file systems and software RAID (i.e. mdadm). Although I usually use XFS, this had truely awful random performance specifically when running on mdadm array for reasons I don't yet understand, whilst JFS worked well but with sequential write workloads jfscommit used progressively more and more CPU, eventually bringing performance to its knees (<10MB/s). Ext4 has proved immune to both these problems but takes too long to delete files approaching 2TB, causing timeouts in ESX - nothing it seems is perfect!

As it happens sequential write speed for vmkernel type operations (for example, copying a vmdk from local to NFS) seems to run at about 60MB/s tops for me. Writing 32k sequential in a guest with IO meter can saturate the link however, with sufficient queue depth.

Anyway, a bit off topic, but HTH

http://blog.peacon.co.uk

Please award points to any useful answer.

Unofficial List of USB Passthrough Working Devices

ehall · ‎10-30-2010

J1mbo, My guests aren't email servers or database servers, they are clients that are used for various kinds of profiling tests. I need to know the limitations of the infrastructure to do that work. I haven't found them yet, because whatever I'm bumping into is clearly too low, especially in comparison to the numbers that are obtainable from other tests (such as local I/O and network throughput). Thanks for the thoughts though.

ehall · ‎10-30-2010

I captured some of the NFS traffic between ESXi and the storage server during iozone operations, and the most obvious difference is that the XP writes cause ESXi to sync after every 65k of data, while the Linux guest only causes ESXi to sync after 512k. In both cases, the outgoing TCP segments are 4k in length.

I decided to simplify things a bit and copied a 4GB ISO file to the local drives, then used "cat dvd.iso > /dev/null" and "type dvd.iso > NUL" respectively to force the large data file to be read from the VMDK, and captured 1000 packets of NFS traffic from each. What that shows is the Linux guest issues multiple parallel reads for 64k of data, which the NFS server provides to ESXi in VERY large segments (sometimes as much as 57k per segment!), On the other hand, the Windows guest issues single (synchronous) requests for 64k of file data, which the NFS server provides to ESXi in sequences of 4k segments. Remember this is the same NFS client and server (just different guests), so the difference in NFS behavior may be an important clue--perhaps thread handling is different for multiple parallel requests versus one single request.

So just with these two data points it seems that I should look into increase the size of the writes and the number of outstanding reads for XP.

I am beginning to suspect that some of this is due to XP itself. Large servers (like for Exchange or SQL Server) will have much larger cluster sizes by default, which would be different from the 4k clusters that are default for this small guest. Multiple outstanding I/O requests should also improve the read results assuming they are interpreted that way by the filesystem driver.

ehall · ‎11-01-2010

I was able to recover the performance that was lost when partitions were aligned (loss of caching) by bumping the read_ahead_kb option on the RAID volume to 1024, which is recommended for database-like I/O anyway. That does not really "fix" the XP VM and boosted all of the VMs by a few percentage points. Improvements to XP are incidental.

http://www.eric-a-hall.com/dumpster/benchmarks/XP-SP3-VM-RA-KBs.png

http://www.eric-a-hall.com/dumpster/benchmarks/XP-SP3-VM-RA-IOPS.png

I also experimented with cluster sizes on the VM a little bit, but did not get much out of it. I was able to increase performance between 0-7% on 2/3rds of the transactions by bumping the cluster size to 16K, but the other 1/3rd of the transactions decreased by 0-7%... not really a wash but not good enough to put up with problems from non-standard cluster sizes (I was unable to boot the VM with other sizes). It may be that if I combined 16K clusters with the paravirtual driver that I would improve Windows performance by 10% total, but I wouldn't be able to work on the partition with any tools whatsoever.

I thought I'd give Win7 a try and it is much better than XP but still not competitive with Linux. On win7 the ESXi client is able to write 512KB before forcing a sync (same as Linux) but it does not produce the same data rate so still not as fast. On reads it is now using 2 threads for linear reads, and the server will periodically use large messages in the replies, which is better than XP using 1 read thread, but still not on par with Linux and its multiple threads. Win7 will ramp up to ~440 Kb/s writes and ~700 Kb/s reads, but Linux pretty much starts off higher than that.

http://www.eric-a-hall.com/dumpster/benchmarks/Win7-VM-KBs.png

http://www.eric-a-hall.com/dumpster/benchmarks/Win7-VM-IOPS.png

The disk I/O in the windows clients is the bottleneck. Just to prove the point I did some tests with a file size of 10MB (small enough to easily cache in VM memory) and performance was 2-3 Gb/s. I tested incrementally larger sizes and basically anything that goes to client disk kills performance, even if the NFS server has the dataset completely cached in it's memory. Basically the disk I/O on Windows clients is crap. I am almost interested enough to test whether the server flavors are implemented any differently but I that is beyond the scope of my current project.

ehall · ‎11-02-2010

Still nailing down some loose items here. The LSI SAS uses a storport Windows driver instead of a scsiport driver, so in theory it should be much faster, and it is for Win7 clients (and Linux too) but is a couple of points slower on XP. Also weird is that Server 2003 R2 does not produce any better throughput than XP, even though it uses the storport driver model. Final best numbers came from Win7 with the bundled LSI SAS driver (which is not shown as storport), with average throughput of 400 Mb/s on writes and 528 on reads, and peak throughput of 480 Mb/s on write and 712 Mb/s on read. IOPS were roughly 11k with 4k blocks.

On the other hand, with all of the other optimizations in place, and with the LSI SAS kernel module compiled into the initrd, the Linux VM is showing AVERAGE writes of 848 Mb/s and reads of 928 Mb/s, and peak writes of 904 Mb/s and reads of 968 Mb/s. These are using iozone's data throughput numbers, and with the NFS/TCP/IP overhead included I am bouncing off the wire limit. IOPS are pushing 27-28k for 4k blocks.

So basically after everything Linux performance is still roughly 2x that of average Windows performance. Everything in consideration (small number of threads, lack of benefit from storport driver model, etc), I suspect that there is a problem with interaction between the Windows disk I/O and the VMware storage susbsystem. Clearly ESXi NFS client is able to push the traffic... there is something peculiar with the Windows subsystem in particular.

ps--As an aside, I also experimented with jumbo frame sizes, and the best numbers come from 4k frames (4136 MTU which leavs 4096 payload after 40 bytes of header overhead). 8k frames increase the maximum application-layer throughput due to the reduction in overhead, but cache operations drop and the final number of IOPS is lower by about 5%.

J1mbo · ‎11-02-2010

This is great info, please keep it coming

http://blog.peacon.co.uk

Please award points to any useful answer.

Unofficial List of USB Passthrough Working Devices

ehall · ‎11-02-2010

I'm glad somebody is finding this info useful. Unfortunately I'm out of ideas and am now catching myself testing the same things all over again.

Is anyone here seeing any Windows guests get near gigabit wire speeds on "local" disk I/O with the VMDK mounted over NFS (and without excess client-side caching)?

LucasAlbers · ‎11-02-2010

I was looking at some of the tweaks that have been done by the vmmark winners, you might get some ideas from the vmmark results.

For example on this particular one, i found a few to tweak network and disk io.

http://www.vmware.com/files/pdf/vmmark/VMmark-Dell-2010-09-21-R715.pdf

Disk.schedNumReqOutstanding=256 default 32

Net.MaxNetifRxQueuLen=300 default 100

Net.MaxnetifTxQueueLen=1000 default 500

Net.vmxnetThrougputWeight=255 default 0

Buffer.cachesoftmaxDiry=85 default 15

Net.TcpipHeapMax=120 default 64

Their are a number of settings you can tweak for disk caching and network buffer size, adjusting these might affect both or push one closer to an optimal configuration. So not sure which way to adjust them.

You could switch the data file system from ntfs to exfat, which windows 7 supports.

J1mbo · ‎11-03-2010

Could you post your iozone command line so the tests can be replicated?

http://blog.peacon.co.uk

Please award points to any useful answer.

Unofficial List of USB Passthrough Working Devices

ehall · ‎11-03-2010

Their are a number of settings you can tweak for disk caching and network buffer size, adjusting these might affect both or push one closer to an optimal configuration. So not sure which way to adjust them.

This is a good suggestion however all of these tweaks are for overall performance which isn't my problem (my scale anyway).

It would probably be a good idea to investigate tweaking the individual *.vmx files. I'll have to drum up some guides.