Highlighted
Expert
Expert

Hosts losing access to NFS share

Hey guys,

I have a weird problem and it is like looking for a needle in a haystack.

I have two vSphere 5 hosts which are connected to IBM N-Series filer running Data Ontap 8.0.2. P3

We are using this environment to test Vmware View.

Sometimes a host loses connectivity to some of my 3 volumes. The volume shows up as inactive.
The other hosts appears to lose no access to the same volumes at the same time.

During all this I can still ping the filer or access a volume on the same filer from the same host.

The weird thing is it is not always necesarilly the same host or same volume that acts weird.
I logged a job with VMware and they basically said it is an issue with filer or networking.

I cannot find anything on the network side. The connection does not go down. Remember some of the volumes are still accessible.

I logged a job with IBM but they don't even bother getting back to me.
I decided to add a different filer in the mix. This time a netapp 2040.

At this stage I have two different filers connected to my two hosts.

The Netapp volumes show the same behaviour. It loses connectivity to some volumes but not all and not necessarilly from same host or same volume.
Now that I have two filers in the mix I lose connectivity to random volumes on both filers. And they don't necessarilly have a VM on them either.
They seem to flap a lot. Some of them come back after a few minutes. Looks like a game of ping pong.

During this time there are no issues with the hosts networking. There is no disconnect from vcenter and only VM that happens to sit on shared volume is affected.

Everything appeared to have been stable for a few days but when I deployed VMs with rapid cloning utility this morning it went all pear shape.
I have also witnessed this behaviour when deploying with Vmware View Connector or just migrating VM's to the filer's datastore.

This rules out the filer's in my opinion.
I also don't believe it is the switch as nothing shows up in the logs.

To me it seems host related and potentially vSphere 5 specific.

Our production environment is configured the same without any issues but that is running vSphere4.

Any ideas because I don't have a clue right now 🙂

Please consider marking my answer as "helpful" or "correct"
0 Kudos
17 Replies
Highlighted
Contributor
Contributor

Hi,

You'll probably hate this 'me too' email because I'm not adding much to the discussion.

Basically, we have a similar setup to you in that we are using a NetApp (3240 / OnTap 7.3.7) to server NFS.

The servers we are using are different though in that they are HP BL460c G7s.

We are experiencing exactly the same problem in that randomly a host will lost connection to all of its NFS datastores, obviously leaving the guests high and dry with no disks.

Did you ever find a solution?  I had this logged with VMWare who weren't able to find the problem - just hinted that it must be a network problem because nothing significant appeared in the VM logs.

Help appreciated.

Richard

0 Kudos
Highlighted
Enthusiast
Enthusiast

Hi AllBlack,

Have you configured the advanced NFS configuration parameters for the ESXi hosts attached to the storage?

If not then these need to be set:

Parameter …

Set to ...

Net. TcpipHeapSize

32

Net.TcpipHeapMax

128

NFS.MaxVolumes

256

NFS.HeartbeatMaxFailures

10

NFS.HeartbeatFrequency

12

NFS.HeartbeatTimeout

5


Let me know how you get on, or if you need help in getting them configured.

Cheers

0 Kudos
Highlighted
Contributor
Contributor

These numbers look really interesting and we comply with almost none of them :smileyshocked:

By any chance, do you have any links or references to support them?

Much appreciated,

Richard

0 Kudos
Highlighted
Enthusiast
Enthusiast

Hi Frank,

I obtained these from a Netapp KB I believe some time ago.

Let me see if I can dig it out.

Cheers

0 Kudos
Highlighted
Commander
Commander

Hi,

Here is couple of KB from VMware which talks about NFS advance settings configuration.

http://kb.vmware.com/kb/2239

http://kb.vmware.com/kb/1007909

http://kb.vmware.com/kb/1012062

Here is link for VMware KB you can use to  intiate troublshooting of NFS issue  http://kb.vmware.com/kb/1003967

Regards

Mohammed

Mohammed Emaad |VCP 3, 4,5 |VCP -NV 6 | VCP-DT 51 | vCAP4-DCA | VCAP5DCA | | Mark it as helpful or correct if my suggestion is useful.
0 Kudos
Highlighted
Enthusiast
Enthusiast

Thanks memaad

I was just about to post the links up for Frank

0 Kudos
Highlighted
Contributor
Contributor

Lot's of reading for the Christmas break!

By any chance, do you have the NetApp KB too?

Many thanks,

Richard

0 Kudos
Highlighted
Enthusiast
Enthusiast

Hi Frank,

It wasn't actually a netapp KB it was the ones that have been posted already.

I was also advised by our third party to configure the settings.

We have had them in place and have experience no issues.

Have a good Christmas

0 Kudos
Highlighted
Virtuoso
Virtuoso

Hi..

Depending on your setup you might also want to check Flow Control settings on connected pSwitches and host pNICS.. According to NetApp flow control should now be disabled on modern network gear...

From: http://media.netapp.com/documents/tr-3749.pdf

"For modern network equipment, especially 10GbE
equipment, NetApp recommends turning off flow control and allowing congestion management to be
performed higher in the network stack. For older equipment, typically GbE with smaller buffers and
weaker buffer management, NetApp recommends configuring the endpoints, ESX servers, and NetApp
arrays with the flow control set to "send."

/Rubeck

0 Kudos
Highlighted
Virtuoso
Virtuoso

Depending on your setup you might also want to check Flow Control settings on connected pSwitches and host pNICS.. According to NetApp flow control should now be disabled on modern network gear...

From: http://media.netapp.com/documents/tr-3749.pdf

"For modern network equipment, especially 10GbE
equipment, NetApp recommends turning off flow control and allowing congestion management to be
performed higher in the network stack. For older equipment, typically GbE with smaller buffers and
weaker buffer management, NetApp recommends configuring the endpoints, ESX servers, and NetApp
arrays with the flow control set to "send."

/Rubeck

fyi - We've been reviewing this flow control topic the past few weeks for our NetApp 3170's and have found that the setting is not supported on CNA cards (only works with plain 10Gb NICs without the extra FCoE chip).  We'll be installing some new cards [on the NetApp heads] and will test in January.  We currently have flow control up and running on a newly deployed 3240 (connected to Nexus 5k's) and it seems to be working.

0 Kudos
Highlighted
Virtuoso
Virtuoso

Frank White wrote:

These numbers look really interesting and we comply with almost none of them:smileyshocked:


(hehe... too funny!)

Best way to apply those NFS best practice settings for ESXi hosts is using the NetApp VSC plugin (i.e. Home > NetApp from the vSphere Client).  If vCenter is not ready then PowerCLI does the trick nicely too.  Host reboot required / desired for settings to take affect.

0 Kudos
Highlighted
Contributor
Contributor

AllBlack,

Any chance you saw any storage side errors like these when the disconnects happen?

>>>>>>> nfsd.tcp.close.idle.notify:warning]:

>>>>>>> Shutting down idle connection to client (xxx.xxx.xxx.xxx) where

>>>>>>> transmit side flow control has been enabled. There are 22

>>>>>>> outstanding replies queued on the transmit buffer.

0 Kudos
Highlighted
Contributor
Contributor

The recommendation from VMWare centred around our networking configuration, namely making use from jumbo frames, vlanning and trunking.  In addition, there are countless references to flow control causing problems especially related to NFS.  I'll post my findings when I have something.

I wonder if someone could help with some sanity checking of our networking design.  This design incorporates ESXi5.1, NetApp 3240 files using NFS and HP BL460c G7 blades using quad and dual port mezzanine cards to give a total of 8 NICs.

One vSwtich on VMWare will be configured with 2 NICs for the sole purpose of NFS for the presentation of NFS to the hosts.

From what I understand:

*             ESXi 5.1 cannot do LACP (only possible with Enterprise Plus licenses and a distributed switch).

*             The NetApp can trunk using LACP, Multimode VIFs or Single mode.

The switch we are using is a single 5406zl chassis with four 24 port modules.  I'm sure you've noticed that we're rather exposed in the event of a chassis failure but this is a risk we are prepared to bear.  It does have the advantantage of making trunking easier though as everything is going through one switch.

Now, my question is:

1.            Do we configure the NICs on the VM host side as a standard trunk (using 'route based on IP')?

2.            Should we configure these switch ports as standard trunk (not LACP for reasons given above)?

3.            Do we configure the vif on the NetApp side as LACP or a multimode trunk (bearing in mind the filer can do LACP but the VMHosts can't)?

4. Assuming 3 is yes, we would presumably configure the trunks on the switch as LACP too.   

Question 3 is probably the important on here.

Help appreciated - you've already been most helpful.

Richard

0 Kudos
Highlighted
Contributor
Contributor

The recommendation from VMWare centred around our networking configuration, namely making use from jumbo frames, vlanning and trunking. In addition, there are countless references to flow control causing problems especially related to NFS. I'll post my findings when I have something.

I wonder if someone could help with some sanity checking of our networking design. This design incorporates ESXi5.1, NetApp 3240 files using NFS and HP BL460c G7 blades using quad and dual port mezzanine cards to give a total of 8 NICs.

One vSwtich on VMWare will be configured with 2 NICs for the sole purpose of NFS for the presentation of NFS to the hosts.

From what I understand:

* ESXi 5.1 cannot do LACP (only possible with Enterprise Plus licenses and a distributed switch).

* The NetApp can trunk using LACP, Multimode VIFs or Single mode.

The switch we are using is a single 5406zl chassis with four 24 port modules. I'm sure you've noticed that we're rather exposed in the event of a chassis failure but this is a risk we are prepared to bear. It does have the advantantage of making trunking easier though as everything is going through one switch.

Now, my question is:

1. Do we configure the NICs on the VM host side as a standard trunk (using 'route based on IP')?

2. Should we configure these switch ports as standard trunk (not LACP for reasons given above)?

3. Do we configure the vif on the NetApp side as LACP or a multimode trunk (bearing in mind the filer can do LACP but the VMHosts can't)?

4. Assuming 3 is yes, we would presumably configure the trunks on the switch as LACP too.

Question 3 is probably the important on here.

Help appreciated - you've already been most helpful.

Richard

0 Kudos
Highlighted
Contributor
Contributor

You'll find all this is outlined in TR3749 http://www.netapp.com/us/media/tr-3749.pdf if you have not seen it. It explains how to configure each side based on the capabilities of your switches but it sounds like you are on the correct track. I definately would love to hear what you discover with the flow control. My customers use 10G so the new recommendation is that be turned off for "newer" equipment.

0 Kudos
Highlighted
Expert
Expert

Sorry for belated reply.

This turned out to be an issue with the HP NIC we were using. Nothing to do with settings as we comply with Netapp best practice.

The HP NC522/523 had this issue and it is actually well known by now. I don't think HP still supplies them.
We used different NIC and had no problems whatsoever since

Please consider marking my answer as "helpful" or "correct"
0 Kudos
Highlighted
Enthusiast
Enthusiast

Old post, but i had the same problem on esxi 5.

In my case the solution was to change the MTU on vmkernel NIC. I found the solution over this post

vmware - lost connection to nfs datastore — mecdata.it

Maybe can help some one.

0 Kudos