VMware Cloud Community
Matt_B1
Enthusiast
Enthusiast

NFS disconnect on all blades but no network or storage issues

We had about 50 of 1000 VMs stop responding for about 15 minutes this morning.  The affected VMs are on different ESX 4.0.0 build 261974 servers but all on one 1 of 2 Netapp clustered controllers serving NFS.  All 4 volumes on "MyFiler" disconnected.  The filers and blades running ESX connect through the same 2 L2 switches.  The filer did not see any network disconnects.  On the switches, the ports for the filer saw traffic drop to near 0 KBps for the 15 minutes.  The NFS volumes are several TBs consisting of well over 50 VMs each so I don't understand why all of the VMs were not affected.  We have not implemented the NetApp recommended vendor setting for disk timeout of 190s and left it as the default put in place by VMware tools.  Any ideas of what caused this?

I have edited/commented some of the logs to remove the unique info...

/var/log/messages [same logs appear on other ESX hosts in the same cluster]
Sep 27 04:01:03 server1 vobd: Sep 27 04:01:03.352: 34164107014900us: [vprob.vmfs.nfs.server.disconnect] ...
Sep 27 04:01:15 server1 vobd: Sep 27 04:01:15.358: 34164119021021us: [vprob.vmfs.nfs.server.disconnect] ...
Sep 27 04:01:15 server1 vobd: Sep 27 04:01:15.488: 34164119150841us: [vprob.vmfs.nfs.server.disconnect] ...
Sep 27 04:01:15 server1 vobd: Sep 27 04:01:15.489: 34164119151386us: [vprob.vmfs.nfs.server.disconnect] ...

/var/log/vmkwarning [same logs appear on other ESX hosts in the same cluster]
Sep 27 04:00:12 server1 vmkernel: 395:10:00:53.724 cpu16:127832)WARNING: VSCSI: 3116: handle 11526(vscsi0:1):WaitForCIF: Issuing reset;  number of CIF:5
Sep 27 04:00:12 server1 vmkernel: 395:10:00:53.965 cpu6:110441)WARNING: VSCSI: 3116: handle 11506(vscsi0:1):WaitForCIF: Issuing reset;  number of CIF:1
Sep 27 04:01:03 server1 vmkernel: 395:10:01:44.772 cpu2:4120)WARNING: NFS: 277: Lost connection to server <IP> mount point ...
Sep 27 04:01:15 server1 vmkernel: 395:10:01:56.778 cpu6:4120)WARNING: NFS: 277: Lost connection to server <IP> mount point ...
Sep 27 04:01:15 server1 vmkernel: 395:10:01:56.778 cpu6:4120)WARNING: NFS: 277: Lost connection to server <IP> mount point ...
Sep 27 04:01:15 server1 vmkernel: 395:10:01:56.778 cpu6:4120)WARNING: NFS: 277: Lost connection to server <IP> mount point ...

MyFiler logs - this is the filer hosting the VMs.  Some of the errors are most likely not related
Tue Sep 27 02:55:19 PDT [MyFiler: sis.changelog.full:warning]: Change logging metafile on volume <VOL> is full and can not hold any more fingerprint entries.

Tue Sep 27 04:11:56 PDT [MyFiler: nfsd.tcp.close.idle.notify:warning]: Shutting down idle connection to client (<IP of ESX17>) where receive side flow control has been enabled. There are 72104 bytes in the receive buffer. This socket is being closed from the deferred queue.

Tue Sep 27 04:19:03 PDT [MyFiler: rlm.orftp.failed:warning]: RLM communication error, receiver timeout waiting for RLM response.
Tue Sep 27 04:19:03 PDT [MyFiler: rlm.driver.hourly.stats:warning]: The software driver for the Remote LAN Module (RLM) detected a problem: Configuration Error (
1).
Tue Sep 27 04:19:15 PDT [MyFiler: replication.src.err:error]: SnapVault: source transfer from ... transfer aborted because of network error.
Tue Sep 27 04:19:27 PDT [MyFiler: replication.src.err:error]: SnapVault: source transfer from ... transfer attempted from busy source.

Reply
0 Kudos
8 Replies
prasadsv
Enthusiast
Enthusiast

Are all the VM's are of same flavour i.e Windows or Linux falvor VM's.If the affected VM's are windows,then you need to change disktimeout.

Can you please upload vmkernel logs for the time frame the issue is observed.

Reply
0 Kudos
EdWilts
Expert
Expert

Matt wrote:

We had about 50 of 1000 VMs stop responding for about 15 minutes this morning.  The affected VMs are on different ESX 4.0.0 build 261974 servers but all on one 1 of 2 Netapp clustered controllers serving NFS.  All 4 volumes on "MyFiler" disconnected.  The filers and blades running ESX connect through the same 2 L2 switches.  The filer did not see any network disconnects.  On the switches, the ports for the filer saw traffic drop to near 0 KBps for the 15 minutes.  The NFS volumes are several TBs consisting of well over 50 VMs each so I don't understand why all of the VMs were not affected.  We have not implemented the NetApp recommended vendor setting for disk timeout of 190s and left it as the default put in place by VMware tools.  Any ideas of what caused this?

[....]

Tue Sep 27 04:11:56 PDT [MyFiler: nfsd.tcp.close.idle.notify:warning]: Shutting down idle connection to client (<IP of ESX17>) where receive side flow control has been enabled. There are 72104 bytes in the receive buffer. This socket is being closed from the deferred queue.

This mesage in your filer logs appears to be the key.  NetApp is highly confused as to whether or not flow control needs to be enabled or disabled.  If you're able to get a definitive answer out of them, please share it.

NetApp has conflicting recommendations and escalations have gotten us nowhere.  We continue to see this issue on only one of our filers even where flow control is disabled on both the filer and the switch ports it's connected to.  Unlike settings like speed and duplex, what you configure on the filer end is merely a suggestion, not an override - the filer first negotiates with the switch and the switch wins, no matter your rc file has in it.

This does not NOT appear to be an ESX issue.  We have standalone blades in an HP C7000 enclosure with storage on the same filer that suffer the same way as the ESXi blades - the IP address listed in the filer's messages file hits all of them too, sometimes at the same time and sometimes not.  We last saw it a few days ago with a roughly 2-minute pause.  The blades in that environment are ESXI 4.1.0 build 433742 but we also saw it on earlier releases.

We're escalating yet again to NetApp...

.../Ed (VCP4, VCP5)
Reply
0 Kudos
Matt_B1
Enthusiast
Enthusiast

The VMs run different flavors of Windows and Linux.  Here are the vmkernel logs...the first set of commands repeated several times

Sep 27 04:00:09 server1 vmkernel: 395:10:00:51.337 cpu3:105293)VSCSI: 2135: handle 11479(vscsi0:0):Reset request on FSS handle 366253058 (4 outstanding commands)
Sep 27 04:00:09 server1 vmkernel: 395:10:00:51.337 cpu10:4221)VSCSI: 2395: handle 11479(vscsi0:0):Reset [Retries: 0/0]
Sep 27 04:00:09 server1 vmkernel: 395:10:00:51.337 cpu10:4221)VSCSI: 2207: handle 11479(vscsi0:0):Completing reset (0 outstanding commands)
Sep 27 04:00:12 server1 vmkernel: 395:10:00:53.724 cpu16:127832)WARNING: VSCSI: 3116: handle 11526(vscsi0:1):WaitForCIF: Issuing reset;  number of CIF:5
Sep 27 04:00:12 server1 vmkernel: 395:10:00:53.724 cpu16:127832)VSCSI: 2135: handle 11526(vscsi0:1):Reset request on FSS handle 713806858 (5 outstanding commands)
Sep 27 04:00:12 server1 vmkernel: 395:10:00:53.724 cpu10:4221)VSCSI: 2395: handle 11526(vscsi0:1):Reset [Retries: 0/0]
Sep 27 04:00:12 server1 vmkernel: 395:10:00:53.724 cpu10:4221)VSCSI: 2207: handle 11526(vscsi0:1):Completing reset (0 outstanding commands)
Sep 27 04:00:12 server1 vmkernel: 395:10:00:53.965 cpu6:110441)WARNING: VSCSI: 3116: handle 11506(vscsi0:1):WaitForCIF: Issuing reset;  number of CIF:1
Sep 27 04:00:12 server1 vmkernel: 395:10:00:53.965 cpu6:110441)VSCSI: 2135: handle 11506(vscsi0:1):Reset request on FSS handle 68654672 (1 outstanding commands)

Sep 27 04:01:03 server1 vmkernel: 395:10:01:44.772 cpu2:4120)WARNING: NFS: 277: Lost connection to server <IP> mount point ...
Sep 27 04:01:15 server1 vmkernel: 395:10:01:56.778 cpu6:4120)WARNING: NFS: 277: Lost connection to server <IP> mount point ...
Sep 27 04:01:15 server1 vmkernel: 395:10:01:56.778 cpu6:4120)WARNING: NFS: 277: Lost connection to server <IP> mount point ...
Sep 27 04:01:15 server1 vmkernel: 395:10:01:56.778 cpu6:4120)WARNING: NFS: 277: Lost connection to server <IP> mount point ...
Sep 27 04:01:17 server1 vmkernel: 395:10:01:58.616 cpu16:4126)FSS: 666: Failed to get object b00f 36 0 40 8ca0a197 20 40000000 8ca0a197 67ec24d 40 a0a197 0 0 0 :No connection
Sep 27 04:01:17 server1 vmkernel: 395:10:01:58.616 cpu19:4125)FSS: 666: Failed to get object b00f 36 0 40 8ca0a197 20 40000000 8ca0a197 67ec24d 40 a0a197 0 0 0 :No connection

Reply
0 Kudos
opper
Contributor
Contributor

We experienced the same issue and changing the net and nfs advanced settings to the NetApp best practice settings cleared the issue. Check you advanced setting for the following:

Net.TcpipHeapSize = 30

Net.TcpipHeapMax = 120

NFS.MaxVolumes = 64

NFS.HeartbeatMaxFailures = 10

NFS.HeartbeatFrequency = 12

NFS.HeartbeatTimeout = 5

keep in mind you can manually make these changes (but changing MaxVolumes requires a host reboot) or you can install the NetApp Vcenter plugin and it will make them for you.

Josh

Reply
0 Kudos
Matt_B1
Enthusiast
Enthusiast

I have had the NetApp recommended settings in place on every host since the start of the cluster.  In the end, it appears the controller was overloaded with operations and had to pause.  They highlighted a higher ratio of over reads pointing to unaligned disks.  I am not completely convinced since we only have a small % of VMs unaligned.

Reply
0 Kudos
jyarborough
Contributor
Contributor

I'm dealing with a similar NFS disconnection issue with NetApp and thought I would share another discussion and link:

A discussion on NetApp's forum:

https://communities.netapp.com/thread/10396

A KB article from NetApp relating to nfsd.tcp.close.idle.notify:warning:

https://kb.netapp.com/support/index?page=content&id=2013194

Basically, they are saying this may just happen and recommend a takeover/giveback or controller reset.  Funny thing about our situation is that the KB article shows it was published April 9, 2012, which is the exact day we first noticed the problem on our FAS.  Anyway, thought it might help.

Reply
0 Kudos
AllBlack
Expert
Expert

Same or similar problem here.

I have two hosts running vSphere5 connected to 3 volumes on filer. I lose connectivity to some of the volumes but no other issues on host.
Recommended settings are set etc,
This is actually a test environment as I investigating VMware View.

Funny thing is our vSphere 4 production environment is connected to same filer but different volumes and does not have any issues.

Looks like I need to open SR with Netapp...

Please consider marking my answer as "helpful" or "correct"
Reply
0 Kudos
EdWilts
Expert
Expert

I'm dealing with a similar NFS disconnection issue with NetApp and thought I would share another discussion and link:

A discussion on NetApp's forum:

https://communities.netapp.com/thread/10396

A KB article from NetApp relating to nfsd.tcp.close.idle.notify:warning:

https://kb.netapp.com/support/index?page=content&id=2013194

Basically, they are saying this may just happen and recommend a takeover/giveback or controller reset.  Funny thing about our situation is that the KB article shows it was published April 9, 2012, which is the exact day we first noticed the problem on our FAS.  Anyway, thought it might help.

More details are coming form NetApp this week on the nfsd.tcp.close.idle.notify issue.  There appears to be a non-public bug with the stack running out of buffer space.  A fix is being worked on.

If you have Oracle RAC on the same filer, you may be running in to an issue there too.  Another non-public bug, this one from Oracle, has to do with a misbehaving direct NFS client.  A patch is available for it and we're rolling it now.

There are NetApp settings you can make to help alleviate the bug although this didn't fix it for us, nor did the takeover/giveback.  We tripped it over the issue again within a couple of days.

.   

Set the TCP receive window size from the default of 64K to 256K.

a.       “options nfs.tcp.recvwindowsize 262144”

b.      The source of this recommendation came from NetApp TR-3557 "HP-UX NFS Performance with Oracle".

c.       By increasing the default 4 x we are giving the protocol stack more room for the application.

2.       Increase the nfs.tcp.xfersize to put more data in flight effectively reducing the handshaking going on at the protocol level.

a.       “options nfs.tcp.xfersize 65536”

.../Ed (VCP4, VCP5)
Reply
0 Kudos