Just a little debugging on this first.... There is not much you can do about NFS from the ESX side, only the server side.
Run from the SC:
Are the settings proper for the connection NFS is using?
Also, you could look at /var/log/vmkernel and see if there are any errors with the data store access.
Edward L. Haletky
VMware Communities User Moderator
Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education. CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354, As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization
There are known performance issues with Windows NFS services and VMware ESX Server.
During my time with VMware Technical Support, we discovered a very large number of TCP packet retransmissions, timeouts, and missed ACKs with Windows Services for Unix. It didn't matter which version of the product (ie. Windows Server with Services for Unix, Windows Storage Server, or Windows Unified Data Storage Server), we always saw the same issues.
Your best bet is to move away from Windows NFS services for now and use a decent Linux or BSD distribution. You'll find a tremendous improvement in overall performance and reliability this way.
Bugger I was really hoping it wasn't going to be services for unix causing this. We had previously tried a Solaris NFS store and had seen no problems, so assumed the windows services would work just as well. Unfortunately we don't have the budget for two servers, and the windows file shares have to take priority for now. I guess NFS will have to wait until next year.
If VMware Technical Support are aware of the issue, would it be worth me raising a case to see if they've gotten any information back from Microsoft about the cause of this?
Have you considered trying an iSCSI target on Windows?
I tried downloading a 3rd party iSCSI server, it seemed to work ok to begin with, but performance ground to a halt towards the end of the XP installation, and benchmark results ended up similar to NFS. Thanks for the suggestion, I'll keep testing iSCSI, but I really want to use NFS if possible.
I'll be testing with a different brand of NIC shortly on that server, but in the meantime I've just used two spare workstations to create identical Solaris and W2k3 server NFS server, and on those the windows server has faster write speeds. On a machine with a single hard disk and a 100Mb NIC I'm getting read speeds of 10MB/s and write speeds of 6.4MB/s. The Solaris server running on an identical machine is giving read speeds of 10MB/s and write speeds of 2.2MB/s.
So from that test, it appears W2k3 server is capable of decent NFS performance, certainly comparable to a default Solaris install. However I really shouldn't be getting the same write speeds from a workstation with a 100Mb NIC and a single IDE hard disk as I get from a dedicated server with a gigabit NIC and 15 SATA drives running off a battery backed raid controller with write caching enabled...
My current plans are to:
Fit gigabit NIC's to the test workstations, and test Solaris and W2k3 NFS performance using those.
Fit the same brand of gigabit NIC to the original server and see if the performance changes.
If anybody has any other suggestions, please let me know. I'm a very experienced windows admin, but VMware, Linux and Solaris are all pretty new to me, so any ideas, no matter how obvious they may seem are welcome.
I'm also reading that NFSv3 now support asynchronous writes and that this has a major effect on performance. Unfortuantely I can't force async mode on the W2k3 server, does anybody know if it's possible to configure the NFS client within ESX to use async writes?
Do you try to access your NFS server via the "normal" way, as a datastore ? ESX 3.5 is buggy this way (response from Vmware : we will fix it in ... ESX 4.0 !) and you got better perf using a NFS mount instead and do your files operation through the console. But either way, you will have bad perf, something seems broken in NFS.
Wow, that's worth knowing. How do I configure NFS via the console and make it available for storing virtual machines on? I'm googling at the minute but haven't found anything yet.
Incidentally, I just tested performance on the workstation installation of W2k3, with a d-link gigabit NIC, and I'm getting 46MB/s read and 27MB/s write performance. I'm about to fit one of those NIC's to the original server now, but I'd still like to try mounting NFS differently
Hmm, there's definately something screwy going on here. I've fitted the d-link gigabit NIC into the server, and it's still timing out when the virtual machine tries to format the disk.
Texiwill suggested I look at the vmkernel log, I really don't know what I'm looking for in there, so here's the log in all its glory: http://www.averysilly.com/vmkernel. This log was taken immediately while running on an Intel gigabit NIC, straight after the virtual machine threw an error trying to delete a partition, and a few minutes earlier I saw the Storage device in the ESX console go inactive for a minute or two. I didn't spot exactly what happened, but it seems the server stopped responding to NFS requests while the XP client install was attempting to format the drive. I haven't seen the store go inactive since installing the d-link NIC, but I'm looking out for that now.
It seems I may be looking for a hardware / configuration clash on the server as opposed to a straight forward incompatibility between ESX / NFS and W2k3 here. My tests seem to show that Windows 2003 Server is fine as a NFS server, and works well with a D-link gigabit NIC. I no longer think I'm looking at a NFS problem since the test on the workstation showed very good write performance, and my earlier iSCSI test on this server showed similar speed issues.
So it's probably not NIC, W2k3 server, or NFS. The server appears to perform fine over windows shares, so whatever it is appears to be specific to ESX. I'm wondering if it's a problem with the motherboard or RAID card with ESX requests.
About to try a standard hard disk and see what performance I get.
Quick update since I might have solved it. Will be about an hour before I get any benchmarks (waiting for XP to install), so I'll confirm the results then.
With a standard hard disk I was still getting timeouts, so that got me thinking about what else might be different between the server and the workstation I'd tested NFS on. Well, one major difference is that the server has two quad core Xeon processors, so I decided to try forcing the NFS service to just use a single core. I set the "Server for NFS" service to logon using a network account, restarted it, then used Task Manager to force "nfssvc.exe" to use processor 0 only.
Started the re-installation on the virtual clients and they flew through the formatting. The raid card still appeared slower than the single drive, so I'll carry on testing that, but you could actually see the difference on the HDD activity lights. Whereas before the lights just flickered on briefly and you could see it walking through the drives during the format, now every I/O light is on solidly while XP is formatting the disk. I was concerned about CPU load, but I've not even seen that processor go over 1% usage yet, so it appears a single core will be fine.
Nope, not that simple. Setting processor affinity has definately made some improvement. By setting that I can ensure the format always completes, but it's not the final solution.
Todays testing makes me think this is a Microsoft problem though, not a VMware or NFS one.
It seems today that old IDE drives outperform any SATA ones when using Microsoft's NFS Server. Both my RAID-6 array and a single SATA disk can only reach 4MB/s write speed. The old IDE disk from the workstation however produced around 27MB/s write performance, and managed the same when installed on the server!
So it seems there are two issues:
NFS can hang or become unresponsive when the service is allowed to execute on multiple processors
NFS write speed is horrendous on modern drives, and around 8x faster on an old IDE disk.
I'm phoning Microsoft today to raise a tech support case for both issues. I'll update this thread with the results later on in case anybody else ever hits the same problem.
Did you get an update on this finally?
Did you finally fix the problem? Please let us know as we are having the same issues.
I too am having the issue. Anyone have a solution or fix. I would hate to have to fry my whole windows storage server installation from HP and move to linux based os just for NFS.
I am having the same problems that you are, have you got an update on this issue yet.
Give that a try - it made a world of difference to us and it's made it usable. It seems that with NFS and Windows 2003 each NFS packet does an LDAP/SID lookup, and becuase Root isn't typically in the mapped list, and only root is trying from an ESX system, each packet has to time out on it's lookup first.
I was working on something else NFS/Linux/Windows related and also hoping to get a single shared ISO/Installs NFS DataStore mount for our ESX servers and found a number of people here in the forums with the same issue, but no real solution. So far, I'm happy with this one, knock on wood.