Solved: Re: ESXi 7.0 NFS v4.1 Not mounting

Intlo · ‎05-12-2020

I ran into something today that I think could be a bug.

Trying to mount an NFS 4.1 share with :

esxcli storage nfs41 add -H 10.10.10.1 -s /data/nfstest -v nfstest

results in:

Unable to complete Sysinfo operation. Please see the VMkernel log file for more details.: Timeout

Running tcpdump from the NFS server shows that ESXi issues an arp request, but never follows up with communication.

I can ping the NFS server from the host and furthermore:

[root@esx:~] nc -z 10.10.10.1 2049
Connection to 10.10.10.1 2049 port [tcp/nfs] succeeded!

vmkernel.log shows the following relevant lines:

2020-05-12T19:06:44.598Z cpu3:2099718 opID=997a8ef8)World: 12458: VC opID esxcli-0f-b668 maps to vmkernel opID 997a8ef8
2020-05-12T19:06:44.598Z cpu3:2099718 opID=997a8ef8)NFS41: NFS41_VSIMountSet:403: Mount server: 10.10.10.1, port: 2049, path: /data/nfstest, label: nfstest, security: 1 user: , options: <none>
2020-05-12T19:06:44.598Z cpu3:2099718 opID=997a8ef8)StorageApdHandler: 976: APD Handle Created with lock[StorageApd-0x431d75a03540]
2020-05-12T19:06:44.598Z cpu3:2099718 opID=997a8ef8)NFS41: NFS41_ConnectionLookup:785: Created new connection for address tcp 10.10.10.1.8.1
2020-05-12T19:06:59.601Z cpu12:2107491)WARNING: SunRPC: 3936: fail all pending calls for client 0x430944c01370 IP 10.10.10.1.8.1 (socket disconnected)
2020-05-12T19:07:14.600Z cpu3:2099718 opID=997a8ef8)WARNING: NFS41: NFS41FSWaitForCluster:3702: Failed to wait for the cluster to be located: Timeout
2020-05-12T19:07:14.600Z cpu3:2099718 opID=997a8ef8)WARNING: NFS41: NFS41_FSMount:4814: NFS41FSDoMount failed: Timeout
2020-05-12T19:07:14.600Z cpu6:2098147)NFS41: NFS41_ConnectionRemove:1071: Connection: 0x431d75a042f0 [55] addr: tcp 10.10.10.1.8.1
2020-05-12T19:07:14.600Z cpu3:2099718 opID=997a8ef8)StorageApdHandler: 1062: Freeing APD handle 0x431d75a03540 []
2020-05-12T19:07:14.600Z cpu3:2099718 opID=997a8ef8)StorageApdHandler: 1146: APD Handle freed!
2020-05-12T19:07:14.600Z cpu3:2099718 opID=997a8ef8)WARNING: NFS41: NFS41_VSIMountSet:411: NFS41_FSMount failed: Timeout
2020-05-12T19:07:14.601Z cpu6:2098147)SunRPC: 1104: Destroying world 0x202863

Note the IP 10.10.10.1.8.1. Where did it get .8.1 from?? Between seeing the additional numbers on the address and not seeing communication (besides a single arp request) via tcpdump on the nfs server, it looks to me like NFS might be buggy on ESXi 7. Unless I'm missing something, but I have had good luck with NFS on previous versions of ESXi.

Updated before posting
I attempted using esxcli storage nfs add -H 10.10.10.1 -s /data/nfstest -v nfstest with a result of:

2020-05-12T19:14:50.383Z cpu12:2099721 opID=58765395)World: 12458: VC opID esxcli-4a-b6a0 maps to vmkernel opID 58765395
2020-05-12T19:14:50.383Z cpu12:2099721 opID=58765395)NFS: 162: Command: (mount) Server: (10.10.10.1) IP: (10.10.10.1) Path: (/data/nfstest) Label: (nfstest) Options: (None)
2020-05-12T19:14:50.383Z cpu12:2099721 opID=58765395)StorageApdHandler: 976: APD Handle 05893e1b-b8b24cee Created with lock[StorageApd-0x431d23c028b0]
2020-05-12T19:15:00.551Z cpu15:2099721 opID=58765395)SunRPC: 3306: Synchronous RPC abort for client 0x430944c01370 IP 10.10.10.1.0.111 proc 3 xid 0x5066cf32 attempt 1 of 3
2020-05-12T19:15:11.551Z cpu15:2099721 opID=58765395)SunRPC: 3306: Synchronous RPC abort for client 0x430944c01370 IP 10.10.10.1.0.111 proc 3 xid 0x5066cf32 attempt 2 of 3
2020-05-12T19:15:22.551Z cpu15:2099721 opID=58765395)SunRPC: 3306: Synchronous RPC abort for client 0x430944c01370 IP 10.10.10.1.0.111 proc 3 xid 0x5066cf32 attempt 3 of 3
2020-05-12T19:15:22.551Z cpu15:2099721 opID=58765395)SunRPC: 1104: Destroying world 0x202877
2020-05-12T19:15:22.551Z cpu15:2099721 opID=58765395)StorageApdHandler: 1062: Freeing APD handle 0x431d23c028b0 [05893e1b-b8b24cee]
2020-05-12T19:15:22.551Z cpu15:2099721 opID=58765395)StorageApdHandler: 1146: APD Handle freed!
2020-05-12T19:15:22.551Z cpu15:2099721 opID=58765395)NFS: 173: NFS mount 10.10.10.1:/data/nfstest failed: Unable to connect to NFS server.

In this one it shows IP 10.10.10.1.0.111. NFSv3 is enabled on the server side. Running tcpdump on the NFS server shows a similar lack of layer 3 communication after the arp requests.

NFS seems to load successfully on boot and my network interfaces are on the HCL.

2020-05-12T04:44:12.366Z cpu11:2097688)Activating Jumpstart plugin nfs.
2020-05-12T04:44:12.366Z cpu12:2097716)Activating Jumpstart plugin nfs41.
2020-05-12T04:44:12.448Z cpu23:2097649)Loading module nfsclient ...
2020-05-12T04:44:12.471Z cpu23:2097649)Elf: 2048: module nfsclient has license VMware
2020-05-12T04:44:12.474Z cpu23:2097649)FSS: 1153: Registered fs nfs, module 4e, fsTypeNum 0xb00f
2020-05-12T04:44:12.474Z cpu23:2097649)VProbe: 802: Loaded 3 static probes from: nfsclient
2020-05-12T04:44:12.474Z cpu23:2097649)Mod: 4997: Initialization of nfsclient succeeded with module ID 78.
2020-05-12T04:44:12.474Z cpu23:2097649)nfsclient loaded successfully.
2020-05-12T04:44:12.497Z cpu11:2097688)Jumpstart plugin nfs activated.
2020-05-12T04:44:12.516Z cpu23:2097649)Loading module nfs41client ...
2020-05-12T04:44:12.541Z cpu23:2097649)Elf: 2048: module nfs41client has license VMware
2020-05-12T04:44:12.549Z cpu23:2097649)VProbe: 802: Loaded 5 static probes from: nfs41client
2020-05-12T04:44:12.549Z cpu23:2097649)Mod: 4997: Initialization of nfs41client succeeded with module ID 79.
2020-05-12T04:44:12.549Z cpu23:2097649)nfs41client loaded successfully.
2020-05-12T04:44:12.573Z cpu12:2097716)Jumpstart plugin nfs41 activated.
2020-05-12T04:44:25.332Z cpu3:2097682)Activating Jumpstart plugin restore-nfs-volumes.
2020-05-12T04:44:25.351Z cpu3:2097682)Jumpstart plugin restore-nfs-volumes activated.

Intlo · ‎05-12-2020

Update, and Solution

Since other clients have used NFS successfully on the network I didn't think twice about the switch being an issue. For my lab I am using an Aruba/HP 2530-24G Switch (J9776A). While looking at the switch configuration I came across an interesting line towards the top:

(Switch)# show config
...
encrypt-credentials
dos-filter
no cdp run
aruba-central disable
...

According to HP:

Denial-of-service (DoS) attack filtering: automatically filters and drops common DoS attack traffic types

Well this seemed interesting so I ran :

(Switch)# show dos-filter
Denial of Service packet filtering: Enabled

Maybe it's seeing NFS between the NFS server and esxi host as DoS traffic? I'll disable it and try to mount the NFS datastore again

(Switch)# no dos-filter

:smileygrin: It worked!!! The esxi host was succesfully able to mount the NFS share.

I have no idea why the switch seemed to allow other NFS connections, and it doesn't appear to be configurable (it just 'automatically' filters things it doesn't like). I am also still curious about the IP 10.10.10.1.8.1 and IP 10.10.10.1.0.111 in vmkernel logs on the host, but for now I can sleep happily knowing that the two are talking.

View solution in original post

Intlo · ‎05-12-2020

Update, and Solution

Since other clients have used NFS successfully on the network I didn't think twice about the switch being an issue. For my lab I am using an Aruba/HP 2530-24G Switch (J9776A). While looking at the switch configuration I came across an interesting line towards the top:

(Switch)# show config
...
encrypt-credentials
dos-filter
no cdp run
aruba-central disable
...

According to HP:

Denial-of-service (DoS) attack filtering: automatically filters and drops common DoS attack traffic types

Well this seemed interesting so I ran :

(Switch)# show dos-filter
Denial of Service packet filtering: Enabled

Maybe it's seeing NFS between the NFS server and esxi host as DoS traffic? I'll disable it and try to mount the NFS datastore again

(Switch)# no dos-filter

:smileygrin: It worked!!! The esxi host was succesfully able to mount the NFS share.

I have no idea why the switch seemed to allow other NFS connections, and it doesn't appear to be configurable (it just 'automatically' filters things it doesn't like). I am also still curious about the IP 10.10.10.1.8.1 and IP 10.10.10.1.0.111 in vmkernel logs on the host, but for now I can sleep happily knowing that the two are talking.

ClemClem76 · ‎01-03-2022

Thanks for the tips on the DoS filter !

For the record it was the DoS option "SYN sPort less 1024" on my switch which was blocking NFS mount

michaeldaly · ‎10-22-2023

hello, I've also encountered the .8.1 anomaly associated with a NFS 4.1 mount failure (192.168.1.18) after rebooting of the host
vmkernel.log shows:

2023-10-22T10:54:46.059Z In(182) vmkernel: cpu4:524534)NFS41: NFS41_ConnectionLookup:804: Created new connection for address tcp 192.168.1.18.8.1
2023-10-22T10:54:46.059Z In(182) vmkernel: cpu0:524819)CpuSched: 873: user latency of 524883 RPC-tx-192.168.1.18.8.1 0 changed by 524819 NFS41SMConnQueue-0 -6
2023-10-22T10:54:46.060Z Wa(180) vmkwarning: cpu0:524819)WARNING: NFS41: NFS41ProcessConnConnect:3623: RPC connection failure for address tcp 192.168.1.18.8.1: Failure
2023-10-22T10:55:16.062Z In(182) vmkernel: cpu0:524819)NFS41: NFS41_ConnectionRemove:1089: Connection: 0x430d63804120 [0] addr: tcp 192.168.1.18.8.1

I'll obviously check the switch but I presume a workaround for the moment would be to add the following lines to cron..(?):

[root@DESKTOP-B5T7:~] esxcfg-nas -d MyNAS
NAS volume MyNAS deleted.
[root@DESKTOP-B5T7:~] esxcfg-nas -a -o 192.168.1.18 -s /data/vmware1 -v 4.1 MyNAS
Connecting to NAS volume: MyNAS
MyNAS created and connected.

All

ESXi 7.0 NFS v4.1 Not mounting