My VM Host is VMware ESXi 5.5 U3 (3029944)
As far as I know, VMware ESXi 5.5 U1 has Intermittent NFS APDs problem. However, even that the VM host is U3, there is intermittent NFS APDs existing.
What shall I do? It does affect my system quite often.
What type of storage are you running? If it is by chance NetApp, be sure to check out NetApp Knowledgebase - How to troubleshoot NFS APD (All-Paths-Down) issues on VMware ESXi - I was hit with that when I ran NetApp.
Do you have multiple hosts, and if yes, are they all affected?
Do you have multiple datastores on the storage, and if yes, are the all affected?
How frequently are you seeing these (hourly, daily, weekly, etc.)
Do you have LAGs setup to the storage?
Hopefully we'll be able to help you out on this.
I too am seeing this problem.
Using Centos 6 NFS.
Been fine when we were on 5.1 but recently upgraded to 5.5 U3 (3343343) and the problem has started.
The underlying server has never experienced APD over the last 2 years and only started since the upgrade.
I have put MaxQueueDepth to 64, same MTU across the board.
We have two NFS mounts from the same server and it only happens on the primary one which has active servers.
Does not occour during the day but normally at night when backups are running and does not seem related to underlying NFS storage as that is still available to all other servers when APD happens on a node.
It seems to happen at the end of the backup when the system is consolodating the VM it was working on.
The load on the underlying NFS server is low and I am not seeing major increased latency on other guest machines while the backup is happening, nothing that would not be expected during the working day.
Was troubleshooting a similar problem earlier this week.
NFS (v3 on ESXi 6.0) was working just fine up until the point where the IP subnet was changed for NFS traffic. Initially the host's vmk adapter and storage array was configured with DHCP address and decision was made to move to a separate / new / isolated network. IPs were changed on array and vmk port to static, NFS datastore removed from host and added back using new array IP. After that VM power operations took a couple of tens of seconds and opening VM console in C# client resulted in error (which I can't remember unfortunately). All these activities (power ops and console openings) were accompanied with short / transient (10-30 seconds) APD for NFS datastore from host's perspective.
Tried quite a lot of things - looking for duplicate IPs, queue depths, permissions / settings on the array with no luck. Then stumbled upon this post: Problems while adding an NFS share to vSphere - PlanetVM and decided to give it a try.
It turned out that the array needs to be able to resolve the host's FQDN in order to serve NFS requests correctly. The new subnet did not have anything in terms of DNS/DHCP/GW, just a plain new non-routed network with 2 nodes connected (host and array). Once we added the record about the ESXi host in arrays hosts file, everything worked like a charm.
No it wasn't - I had this issue a few years ago and it was related to the vSphere version (can't recall if it was 5.1 or 5.5) and the version of OnTap that my NetApp was running. (VMware KB: NFS connectivity issues on NetApp NFS filers on ESXi 5.x/6.0).
Good to know that the DNS resolution could be the culprit too, even when IPs are used.
I have the same issue with ESXi 5.5 U3 as well. After upgrading from 5.1 to 5.5 Build 4179633 (a.ka. Patch 8), we experience random NFS APD's, but only on our highest I/O cluster (all running Microsoft SQL VMs on Windows 2008 R2). Name resolution from the VNX to ESX FQDN is okay (confirmed this even though we use IP address to mount the volumes). We have SIOC on and NFS Max Queue Depth set to 64 and are running the latest ixgbe Async driver for our 10 Gb networking.
Note: The issue happens with or without SIOC or queue depth tweaks. It also happens regardless of using the inbox or async ixgbe driver.
When the problem manifests, the APD timer expires and the datastore appears as 'mounted' but 'inaccessible'. All other datastores on the affected host are okay. All other hosts that access the datastore in question are okay. It's random which datastore or host is affected. The DS will not become available again until rebooting the ESX host.
My issue was determined to be EMC bug number 850730.
More info on the troubleshooting process at: