Storage on NFS provides very poor performance to V...

zeevik · ‎11-11-2008

I have two ESX hosts (version 3.5 build 120512), with both local storage and NFS Based storage (Linux server, 4 * 500GB Raid-5).

The Vms using the NFS storage are getting very poor performance.

For example, copying a 45MB file to the same folder:

a VM located on the ESX local storage finishes this task in 6.3sec (=45MB read + 45MB write / 6.3 sec = 14MB/sec)

a VM located on the NFS based storage finishes this task in 1min18sec (=45MB read + 45MB write / 78 sec = 1.15MB/sec)

(same VM type, same OS, same resources)

This is not the NFS server problem. A real PC connected to the server gets +50MB.

The ESX host itself (from its console) completes the task (copying the file from a folder on NFS mount to the same folder) in 4.5 sec (=20MB/sec)

Testing the network connection from the ESX host to the NFS server (using Iperf) results with 500Mb/ses (=+50MB/sec).

Sorry for the long story, hope the problem is clear.

What can be the problem ? How can I fix that ?

What else to check ?

Thanks,

Zeevik.

RParker · ‎11-11-2008

What is the NFS target? What are the disks that comprise the volume on NFS? And I assume that the physical machine points to the same NFS volume that you tested?

zeevik · ‎11-12-2008

this is the nfs configuration

/NAS/VMFS 10.55.0.0/255.255.0.0(rw,no_root_squash,insecure,async,no_subtree_check)

All tests were done with files located in sub folders on the above.

This is the esx configuration

# esxcfg-nas -l

QA-NAS-01 is /NAS/VMFS from 10.55.1.15 mounted

what do you mean by "What are the disks that comprise the volume on NFS" ?

ges1234 · ‎03-10-2009

Did you ever find a solution for you problems?

I think I have the same.

mgroff3 · ‎06-03-2009

I believe I'm running into the same issue as well. Can we revive this thread?

-Marcus

dilidolo · ‎06-03-2009

You may take a look at solaris/OpenSolaris, it provides very good performance for NFS and it's very easy to manage. I use it as NFS/CIFS/iSCSI/FC server.

I used Openfiler for my test lab but I don't like the way to configure it. To me, Opensolaris just makes more sense as it's very close to the features NetApp has.

What NIC are you using? Your iperf result is very low, for Intel/Broadcom nic, you should see over 900mb/s.

mgroff3 · ‎06-03-2009

Thanks for the OpenSolaris suggestion--I'm a fan personally, but it doesn't speak to my particular issue. After some more research, I found the following thread which might explain the performance issues though its not at all clear that this applies to *nix variants (and indeed, seems to indicate that *nix clients should not be affected):

http://communities.vmware.com/message/997508

The last post from drummonds indicates that vmfs will commit NFS writes to disk prior to sending a TCP ACK--that will certainly cause a nasty performance issue. In my case however, I'm not even using NFS as a VMkernel storage device. I'm mounting an NFS share (as a client) on a RHES4 ESX guest and seeing write performance that is half-again as long as that of a standalone (ie, not virtualized) RHES4 client. Same share, same server, same network topology. Seems clear that there is a serious issue with NFS performance on the ESX stack...

-Marcus

RParker · ‎06-03-2009

Seems clear to me that there is a serious issue with NFS performance on the ESX stack...

You are missing 1 piece of crucial information. What type of disks are these? SATA or SAS? That is MOST important for IO.

RAID type is also an issue, but the drives themselves are the BIGGEST impact on performance.

mgroff3 · ‎06-03-2009

RParker,

Thanks for the response. I'm not sure I follow, however... I'm not configuring a volume on the ESX host via NFS. I have an ESX guest and am mounting a file share (served from a netapp) directly within the client. The problem is that the performance is much worse on the ESX guest than it is on an exact copy (of the ESX guest) built on standalone hardware. There is no network contention (or disk contention for that matter) on the ESX host (note the storage for the VM is actually local SAS drives: 2 x15K RPM 300Gb RAID 1 config w/ a Dell PERC5 controller).

TIA,

Marcus

RParker · ‎06-03-2009

VM is actually local SAS drives: 2 x15K RPM 300Gb RAID 1 config w/ a Dell PERC5 controller).

From that info, I don't see you are confused at all, that's exactly what I need to know.

With that drive, you are realistically putting your VM's on 1 drive. There is your performance cap right there.

For one a PERC controller is great for single purpose jobs, like a SQL server, file server, or something. You install Physical OS, only run bare metal installs, and no virtualization. PERC controllers are not really suited for VM hosting. Especially with 1 drive (it's a mirror, but you only benefit from a mirror for READS, write take twice a long as they normally would since they have to update 2 drives.

Your SAN is netapp, but what are the drives on the Netapp? How big is the aggregrate, and if it's NFS how many drives?

My suggestion would be to purchase 4 more drives just like it. Make the entire RAID (all 6 disks) a RAID 10. You will have ~ 900GB usable space. If you want to increase the performance, that's how you do it. If not, there isn't much more you can do. The performance is what it is.

A SAN would be better than the local disks. the ONLY way to increase performance is increase IO. You do that by adding disks. That's the solution.

mgroff3 · ‎06-03-2009

Umm...I think we're talking about different things. The disks (which as you point out amount to 1 logical volume) in question aren't really involved in the transaction per se. In so much as they comprise the storage volume used for VMFS, they are involved, but the writes are occuring on the netapp, so I don't see the relationship. More to the point, I'm not hitting a high-end performance cap. I'm seeing poor performance even for very meager demands. As an example, if I copy a ~60Mb file from the esx guest to the NFS share:

# time cp /home/build_rhodes_63_7.tgz /maven/Published/

real 0m9.134s

user 0m0.007s

sys 0m0.803s

The same file copy when from the standalone host:

l3-mb2pub01:/# time cp /home/build_rhodes_63_7.tgz /maven/Published/

real 0m6.117s

user 0m0.004s

sys 0m0.696s

That is a drastic difference and I'm not even pushing the VM to produce it. My guess is that is an issue with the ESX host's management of resources.

-Marcus

dilidolo · ‎06-03-2009

We use NetApp NFS with ESX in production, disks are 15k rpm fc disks, never have performance issues. In our dev environment, disks are sata, it's a bit slow but still usable.

NFS is OK, but you need a good NFS server and fast disks, also a bit tuning.

RParker · ‎06-03-2009

I'm not hitting a high-end performance cap. I'm seeing poor performance even for very meager demands.

OK, what disks are on the Netapp? Are they FC (SAS) disks or SATA disks?

That will still make a difference in performance. What about a physical machine pointing to that same NFS store, performance is still slow?

that would rule out ESX as the culprit.

Also are you doing this in a VM or at the service console? Service console in ESX is VERY limited, not a good test. VM's have a different access method than the SC uses, so you will see different results.

mgroff3 · ‎06-03-2009

Answers in-line below:

OK, what disks are on the Netapp? Are they FC (SAS) disks or SATA disks?

==> They are SATA drives

That will still make a difference in performance. What about a physical machine pointing to that same NFS store, performance is still slow?

==> That is exactly the point. A physical machine with the exact same config pointed to the same NFS store on the same filer plugged in to the same switch has write speeds that are an order of magnitude faster that those of the VM. The actual performance numbers are arbitrary--the item of interest is the discrepancy between the VM and the physical machine.

that would rule out ESX as the culprit.

==> That is exactly what points to ESX being the culprit.

Also are you doing this in a VM or at the service console? Service console in ESX is VERY limited, not a good test. VM's have a different access method than the SC uses, so you will see different results.

==> Doing this from the VM, not the SC. If you look at my previous posts, I've been using the following nomenclature: ESX guest == VM ; ESX host == SC

-Marcus

Rumple · ‎06-04-2009

Do you have dedup turned on for the SATA NFS mount? There is a bug in netapp filers below a specific code level (I forget which specifically) that will cause horrendous read speeds when using dedup. I hit this in our DR environment..made the system unusable, but dedup in prod with same code on a much larger FC aggregate never showed any issues. We turned off dedup and everything performed fine.

I haven't done any real benchmarking of the nfs volume I had which was 50vm's on a single aggregate across a tray of SAS disk but i never saw any performance issues either.

I would think that performance of a single Physical system hitting the NFS mount and a single ESX host with a single VM running accessing the nfs mount come in somewhat close in performance (personally I'd expect probably a 10-20% lower performance number out of ESX just because of the virtualization of the network traffic. In ESX 4 is supposed to be much much better then esx 3.5

Have you tried running iometer in a single VM against the nfs target and running iometer in multiple VM's at the same time to see if the performance limit is per VM or you get 2 VM's each running slower as they hit the bandwidht limit (while watching network utilization)

All

Storage on NFS provides very poor performance to VMs