VMware Cloud Community
DOakman
Contributor
Contributor

NFS All Paths down with 5.5 U2

All,

We are seeing the symptoms of the following KB article VMware KB: Intermittent NFS APDs on VMware ESXi 5.5 U1    when running ESXi 5.5 U2.

The vmware logs are consistent with the KB article.

The resolution for this is to apply Express Patch 4, which can not be installed when running 5.5 U2.

vmware has provided insight that the PR is still open so fix likely not included in U2.

We have a P1 case opened with vmware on this but support has been slow to respond with any meaningful direction.

The connectivity path is clean and has plenty of available capacity so we are confident that it is not the network path.

Another KB article VMware KB: NFS connectivity issues on NetApp NFS filers on ESXi 5.x specific to NetApp filer does not appear to apply as we are running code which has the NetApp specific issue resolved (data ontap 8.2.2) and have SIOC enabled (one of the recommended workarounds for netapp specific issue). This article also refers back to the above KB article.

This issue is causing crippling issues with the View VDI environment, so i wanted to reach out and see if anyone has seen similar issues or has recommendations on direction to move.

We are considering dropping back to ESXi 5.5 U1 and applying patch, utilizing either iSCSI or FC to eliminate NFS related issues. vmware said to think about disabling jumbo frames (not a fan of this as the input didn't seem to be based on any real data).

Thanks in advance

DaveO

6 Replies
RobertGroenewal
Contributor
Contributor

Hi Dave,

I had a similair problem with the latest build of vmware and my nfs storage. From the logs I got the apd issue and the storage was in a connect/reconnect loop situation. Intresting thing is I'm not using Netapp, and my vendor and I are pretty sure it's not the network or storage side.

I had a sev 1 case opened but the issue was "solved" due to reboots and cleaning the inventory all over again.

I'm very interested what your outcome will be!

Regards,

Robert

Reply
0 Kudos
Frankenheinz
Contributor
Contributor

Hi All,

right now, we're experiencing the same problems.

Storage and network are not the problem.

Storage-IO Control seems to be a problem which we disabled.

The hangs last about 10 seconds.

At that time we can see, that "esxcli network ip connection list" shows vmm0:<vmname> hangs around and does nothing.

After 5 seconds, the NFS IO continues.

I think that there's a severe BUG in the NFS stack.

Also changing NFS parameters and TSO didn't change anything.

Right now we have a Sev 2 call.

Will keep you up2date.

Rgds.

Frank

Reply
0 Kudos
RobertGroenewal
Contributor
Contributor

Hi Frank,

Yes it looks like an issue with the NFS stack. My storage vendor told me they are recreating the issue right now (together with VMware).

It looks sort of the same as the old NFS bug. The storageIO thing kept doing weird stuff on my cluster as well ( although I have no license to use it anyway ).

If I receive any updates I'' inform you guys right away.

Robert

Reply
0 Kudos
Frankenheinz
Contributor
Contributor

Hi Robert,

we disabled storage io control completely.

we saw, when enabled, that the hang lasted longer. (<15s)

Right now it's about 5-10s.

What i forgot to mention, there's a default newreno congestion control.

Perhaps it's a problem with the congestion control and the embedded kernel.

Nevertheless, we should update the thread for other users, having the same issues.

Rgds.

Frank

Frankenheinz
Contributor
Contributor

Hi Robert,

we solved the problem on our own.

On our storage, the resolver wasn't working and default broken 😉

What we've done so far:

1.) added all storage and vm hosts to /etc/hosts

All storage and vm hosts

Also the local hostname ! (fqdn and short name)     

2.) changed the nssswitch.conf (files dns)

3.) enabled dns client (not enabled)

4.) disabled mdns

5.) disabled nscd

Debugged with snoop on udp port 53

Problem gone

Rgds.

Frank

RobertGroenewal
Contributor
Contributor

Hi Frank,

Thanks for updating!

Unfortunately in my case I'm not using dns resolving for my NFS exports. But it's woth trying to recreate the scenario in my lab with and without dns resolving.

At the time the issue occured my DNS server was having issues as well.

I'll check if I can recreate it.

Thanks!

Robert

Reply
0 Kudos