1 2 Previous Next 17 Replies Latest reply on Aug 5, 2020 3:36 AM by cicco171

    Hosts losing access to NFS share

    AllBlack Expert

      Hey guys,

       

      I have a weird problem and it is like looking for a needle in a haystack.

       

      I have two vSphere 5 hosts which are connected to IBM N-Series filer running Data Ontap 8.0.2. P3

      We are using this environment to test Vmware View.

       

      Sometimes a host loses connectivity to some of my 3 volumes. The volume shows up as inactive.
      The other hosts appears to lose no access to the same volumes at the same time.

      During all this I can still ping the filer or access a volume on the same filer from the same host.

       

      The weird thing is it is not always necesarilly the same host or same volume that acts weird.
      I logged a job with VMware and they basically said it is an issue with filer or networking.

      I cannot find anything on the network side. The connection does not go down. Remember some of the volumes are still accessible.

       

      I logged a job with IBM but they don't even bother getting back to me.
      I decided to add a different filer in the mix. This time a netapp 2040.

      At this stage I have two different filers connected to my two hosts.

       

      The Netapp volumes show the same behaviour. It loses connectivity to some volumes but not all and not necessarilly from same host or same volume.
      Now that I have two filers in the mix I lose connectivity to random volumes on both filers. And they don't necessarilly have a VM on them either.
      They seem to flap a lot. Some of them come back after a few minutes. Looks like a game of ping pong.

       

      During this time there are no issues with the hosts networking. There is no disconnect from vcenter and only VM that happens to sit on shared volume is affected.

       

      Everything appeared to have been stable for a few days but when I deployed VMs with rapid cloning utility this morning it went all pear shape.
      I have also witnessed this behaviour when deploying with Vmware View Connector or just migrating VM's to the filer's datastore.

       

      This rules out the filer's in my opinion.
      I also don't believe it is the switch as nothing shows up in the logs.

       

      To me it seems host related and potentially vSphere 5 specific.

       

      Our production environment is configured the same without any issues but that is running vSphere4.

       

       

      Any ideas because I don't have a clue right now :-)

        • 1. Re: Hosts losing access to NFS share
          richardjjuk Lurker

          Hi,

           

          You'll probably hate this 'me too' email because I'm not adding much to the discussion.

          Basically, we have a similar setup to you in that we are using a NetApp (3240 / OnTap 7.3.7) to server NFS.

          The servers we are using are different though in that they are HP BL460c G7s.

           

          We are experiencing exactly the same problem in that randomly a host will lost connection to all of its NFS datastores, obviously leaving the guests high and dry with no disks.

           

          Did you ever find a solution?  I had this logged with VMWare who weren't able to find the problem - just hinted that it must be a network problem because nothing significant appeared in the VM logs.

           

          Help appreciated.

           

          Richard

          • 2. Re: Hosts losing access to NFS share
            Gooose Hot Shot

            Hi AllBlack,

             

            Have you configured the advanced NFS configuration parameters for the ESXi hosts attached to the storage?

             

            If not then these need to be set:

             

            Parameter …

            Set to ...

            Net. TcpipHeapSize

            32

            Net.TcpipHeapMax

            128

            NFS.MaxVolumes

            256

            NFS.HeartbeatMaxFailures

            10

            NFS.HeartbeatFrequency

            12

            NFS.HeartbeatTimeout

            5


            Let me know how you get on, or if you need help in getting them configured.

             

            Cheers

            • 3. Re: Hosts losing access to NFS share
              Frank White Lurker

              These numbers look really interesting and we comply with almost none of them

              By any chance, do you have any links or references to support them?

               

              Much appreciated,

              Richard

              • 4. Re: Hosts losing access to NFS share
                Gooose Hot Shot

                Hi Frank,

                 

                I obtained these from a Netapp KB I believe some time ago.

                 

                Let me see if I can dig it out.

                 

                Cheers

                • 5. Re: Hosts losing access to NFS share
                  memaad Master

                  Hi,

                   

                  Here is couple of KB from VMware which talks about NFS advance settings configuration.

                   

                  http://kb.vmware.com/kb/2239

                   

                  http://kb.vmware.com/kb/1007909

                   

                  http://kb.vmware.com/kb/1012062

                   

                  Here is link for VMware KB you can use to  intiate troublshooting of NFS issue  http://kb.vmware.com/kb/1003967

                   

                  Regards

                  Mohammed

                  • 6. Re: Hosts losing access to NFS share
                    Gooose Hot Shot

                    Thanks memaad

                     

                    I was just about to post the links up for Frank

                    • 7. Re: Hosts losing access to NFS share
                      Frank White Lurker

                      Lot's of reading for the Christmas break!

                      By any chance, do you have the NetApp KB too?

                       

                      Many thanks,

                      Richard

                      • 8. Re: Hosts losing access to NFS share
                        Gooose Hot Shot

                        Hi Frank,

                         

                        It wasn't actually a netapp KB it was the ones that have been posted already.

                         

                        I was also advised by our third party to configure the settings.

                         

                        We have had them in place and have experience no issues.

                         

                        Have a good Christmas

                        • 9. Re: Hosts losing access to NFS share
                          Rubeck Master

                          Hi..

                           

                          Depending on your setup you might also want to check Flow Control settings on connected pSwitches and host pNICS.. According to NetApp flow control should now be disabled on modern network gear...

                           

                          From: http://media.netapp.com/documents/tr-3749.pdf

                           

                          "For modern network equipment, especially 10GbE
                          equipment, NetApp recommends turning off flow control and allowing congestion management to be
                          performed higher in the network stack. For older equipment, typically GbE with smaller buffers and
                          weaker buffer management, NetApp recommends configuring the endpoints, ESX servers, and NetApp
                          arrays with the flow control set to "send."

                          /Rubeck

                          • 10. Re: Hosts losing access to NFS share
                            grasshopper Virtuoso

                            Depending on your setup you might also want to check Flow Control settings on connected pSwitches and host pNICS.. According to NetApp flow control should now be disabled on modern network gear...

                             

                            From: http://media.netapp.com/documents/tr-3749.pdf

                             

                            "For modern network equipment, especially 10GbE
                            equipment, NetApp recommends turning off flow control and allowing congestion management to be
                            performed higher in the network stack. For older equipment, typically GbE with smaller buffers and
                            weaker buffer management, NetApp recommends configuring the endpoints, ESX servers, and NetApp
                            arrays with the flow control set to "send."

                            /Rubeck

                             

                            fyi - We've been reviewing this flow control topic the past few weeks for our NetApp 3170's and have found that the setting is not supported on CNA cards (only works with plain 10Gb NICs without the extra FCoE chip).  We'll be installing some new cards [on the NetApp heads] and will test in January.  We currently have flow control up and running on a newly deployed 3240 (connected to Nexus 5k's) and it seems to be working.

                            • 11. Re: Hosts losing access to NFS share
                              grasshopper Virtuoso

                              Frank White wrote:

                               

                              These numbers look really interesting and we comply with almost none of them


                              (hehe... too funny!)

                               

                              Best way to apply those NFS best practice settings for ESXi hosts is using the NetApp VSC plugin (i.e. Home > NetApp from the vSphere Client).  If vCenter is not ready then PowerCLI does the trick nicely too.  Host reboot required / desired for settings to take affect.

                              • 12. Re: Hosts losing access to NFS share
                                Cecil_M Lurker

                                AllBlack,

                                 

                                Any chance you saw any storage side errors like these when the disconnects happen?

                                >>>>>>> nfsd.tcp.close.idle.notify:warning]:

                                >>>>>>> Shutting down idle connection to client (xxx.xxx.xxx.xxx) where

                                >>>>>>> transmit side flow control has been enabled. There are 22

                                >>>>>>> outstanding replies queued on the transmit buffer.

                                • 13. Re: Hosts losing access to NFS share
                                  richardjjuk Lurker

                                  The recommendation from VMWare centred around our networking configuration, namely making use from jumbo frames, vlanning and trunking.  In addition, there are countless references to flow control causing problems especially related to NFS.  I'll post my findings when I have something.

                                   

                                  I wonder if someone could help with some sanity checking of our networking design.  This design incorporates ESXi5.1, NetApp 3240 files using NFS and HP BL460c G7 blades using quad and dual port mezzanine cards to give a total of 8 NICs.

                                   

                                  One vSwtich on VMWare will be configured with 2 NICs for the sole purpose of NFS for the presentation of NFS to the hosts.

                                   

                                  From what I understand:

                                  *             ESXi 5.1 cannot do LACP (only possible with Enterprise Plus licenses and a distributed switch).

                                  *             The NetApp can trunk using LACP, Multimode VIFs or Single mode.

                                   

                                  The switch we are using is a single 5406zl chassis with four 24 port modules.  I'm sure you've noticed that we're rather exposed in the event of a chassis failure but this is a risk we are prepared to bear.  It does have the advantantage of making trunking easier though as everything is going through one switch.

                                   

                                  Now, my question is:

                                  1.            Do we configure the NICs on the VM host side as a standard trunk (using 'route based on IP')?

                                  2.            Should we configure these switch ports as standard trunk (not LACP for reasons given above)?

                                  3.            Do we configure the vif on the NetApp side as LACP or a multimode trunk (bearing in mind the filer can do LACP but the VMHosts can't)?

                                  4. Assuming 3 is yes, we would presumably configure the trunks on the switch as LACP too.   

                                   

                                  Question 3 is probably the important on here.

                                   

                                  Help appreciated - you've already been most helpful.

                                  Richard

                                  • 14. Re: Hosts losing access to NFS share
                                    richardjjuk Lurker

                                    The recommendation from VMWare centred around our networking configuration, namely making use from jumbo frames, vlanning and trunking. In addition, there are countless references to flow control causing problems especially related to NFS. I'll post my findings when I have something.

                                     

                                    I wonder if someone could help with some sanity checking of our networking design. This design incorporates ESXi5.1, NetApp 3240 files using NFS and HP BL460c G7 blades using quad and dual port mezzanine cards to give a total of 8 NICs.

                                     

                                    One vSwtich on VMWare will be configured with 2 NICs for the sole purpose of NFS for the presentation of NFS to the hosts.

                                     

                                    From what I understand:

                                    * ESXi 5.1 cannot do LACP (only possible with Enterprise Plus licenses and a distributed switch).

                                    * The NetApp can trunk using LACP, Multimode VIFs or Single mode.

                                     

                                    The switch we are using is a single 5406zl chassis with four 24 port modules. I'm sure you've noticed that we're rather exposed in the event of a chassis failure but this is a risk we are prepared to bear. It does have the advantantage of making trunking easier though as everything is going through one switch.

                                     

                                    Now, my question is:

                                    1. Do we configure the NICs on the VM host side as a standard trunk (using 'route based on IP')?

                                    2. Should we configure these switch ports as standard trunk (not LACP for reasons given above)?

                                    3. Do we configure the vif on the NetApp side as LACP or a multimode trunk (bearing in mind the filer can do LACP but the VMHosts can't)?

                                    4. Assuming 3 is yes, we would presumably configure the trunks on the switch as LACP too.

                                     

                                    Question 3 is probably the important on here.

                                     

                                    Help appreciated - you've already been most helpful.

                                    Richard

                                    1 2 Previous Next