Re: Slow NFS mount after Update to 5.1 U3

GreyhoundHH · ‎07-03-2015

Hi,

we're experiencing a strange issue on hosts, that have been upgraded to 5.1 U3 with VUM.

During boot it takes ~30s for each NFS-datastore to be mounted:

2015-07-03T13:30:14.878Z cpu16:16886)NFS: 157: Command: (remount) Server: (x.x.x.x) IP: (x.x.x.x) Path: (/XXX) Label: (XXX) Options: (None)

2015-07-03T13:30:14.878Z cpu16:16886)StorageApdHandler: 692: APD Handle 914c0a59-d8871221 Created with lock[StorageApd0x410062]

2015-07-03T13:30:44.966Z cpu32:16886)NFS: 218: NFS mount 172.16.16.50:/XXX status: Success

2015-07-03T13:30:44.984Z cpu32:16886)NFS: 157: Command: (remount) Server: (x.x.x.x) IP: (x.x.x.x) Path: (/YYY) Label: (YYY) Options: (None)

2015-07-03T13:30:44.984Z cpu32:16886)StorageApdHandler: 692: APD Handle 558dfa35-a2954b0f Created with lock[StorageApd0x410062]

2015-07-03T13:31:15.024Z cpu32:16886)NFS: 218: NFS mount 172.16.16.50:/YYY status: Success

This is an example from out staging environment with only two datastores which causes only one minute boot-delay. In our production it leads to well above 30 minutes delay which we can't tolerate.

Staging/Production uses different kind of servers, storage array but show teh same problem. Only hosts updated to U3 are affected, all other hosts boot normally.

SR with VMware ist already open (for a way too long time) but still no resolution in sight.

Any help is appreciated.

Regards

RyanH84 · ‎07-05-2015

Hi,

A few quick questions:

1) What storage device is serving out NFS?

2) Has anything else changed other than the VMware upgrade? (No networking changes, etc).

3) Can you give a bit more background in to how your hosts are connected via the network to your storage device?

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk

GreyhoundHH · ‎07-06-2015

Hi,

just a few quick answers

1) Netapp FAS6210 and FAS6280

2) No, all other hosts are also running fine

3) Each host is using 2x 10GB Adapters (via LACP) for all traffic including storage

Kind regards

RyanH84 · ‎07-06-2015

Thanks for that!

Are you able to provide the full vmkernel and vobd logs from the affected host that you gave the snippet from below?

Also, do you know if you have an NFS MaxQueueDepth setting on your hosts?

To set the NFS.MaxQueueDepth advanced parameter using the vSphere Client:

Click the host in the Hosts and Clusters view.
Click the Configuration tab, then click Advanced Settings under Software.
Click NFS, then scroll down to NFS.MaxQueueDepth

What is the value set (if at all?)

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk

UmeshAhuja · ‎07-06-2015

Hi,

Similar kind of issue I was facing earlier , But in my scenario I was having multiple RDM attached to the VM's in cluster configuration. Hence to overcome from the problem I did the settings mention in the below article after which my ESXi host started coming in 5-6 Min which was previously taking time as 20 Min to come up.

Have a look might be similar problem as you have mention

VMware KB: ESXi/ESX hosts with visibility to RDM LUNs being used by MSCS nodes with RDMs may take a ...

Thanks n Regards
Umesh Ahuja

If your query resolved then please consider awarding points by correct or helpful marking.

GreyhoundHH · ‎07-06-2015

Hi,

@Umesh: We're not using any Block Storage, just NFS.

I've attached the logs of one of our staging hosts having this issue. This one has the default NFS.QueueDepth of 4294967295 configured, but production is using 64.

Regards

finnzi · ‎07-06-2015

Hi,

A shot in the dark...but is DNS working correctly on both the ESXi side as well as on the storage side?

Can the host resolve the IP addresses of the storage to names and vice versa?

Bgrds,

Finnur

---- Finnur Orn Gudmundsson

RyanH84 · ‎07-06-2015

Hi,

So Finnzi has gone down one route that I would have asked. I can see however, that your NFS Mount commands in vmkernel log are resolving an IP address. It is after the StorageSpdHandler creates the lock that there is the mount success around 30 seconds later.

I keep seeing this article come up in my travels which is why I asked about the MaxQueueDepth parameter. It is interested that this is how it is set in Production but not your Development servers where you are seeing the issue. Is there any chance you can set a host to have this parameter and then reboot it to take effect? It is a fairly straightforward change which could possibly help.

You mentioned your network is 2 * 10Gb configured via LACP on the switch side. Can you comment more about the VMware networking side of things? How is your virtual networking layer configured? vSwitch? dvSwitch? 1000V? Can you comment on the teaming, load balancing and fail over settings within there? (I'm just trying to get a bigger picture of how thing are configured).

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk

GreyhoundHH · ‎07-06-2015

Hi,

correct, we're using IP-addresses to connect to storage. So there should be no name resolution issues.

In Production we've set the MaxQueueDepth according to the mentioned KB-article and due to VMware recommending this to us in a Service Request.

Setting/Testing the parameter in the Test-environment should be no problem, but we see the issue in both environments.

In both environments the hosts are connected to a Cisco 1000v dvSwitch. Unfortunately I don't have further information about the configuration at this point.

One more difference that might be of interest:

Production is using JumboFrames from End-to-End (ESXi <-> Storage)

Test seems no to be properly configured, as I'm unable to ping the Storagearray with JumboFrames. Yesterday I played a bit with different MTU-settings but it made no difference.

Kind regards

RyanH84 · ‎07-07-2015

Hi,

If you are seeing the same issue in both environments then that is a real pain.

I was interested in the networking side as there are a few things that you should have configured for your networking if you are using NFS with LACP enabled on the switch. (Active/Active uplinks, Load balance on IP hash, etc). So I was hoping we could check that side of things.

Ideally you should have your test setup the same as your Prod environment with Jumbo frames, it is recommended.

I'm slightly running out of ideas, I'm not a big NFS or Netapp person myself. Have you raised this with NetApp too? I've heard good things about their support!

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk

GreyhoundHH · ‎07-07-2015

Just as a background: I was not involved in planning/building this Test environment but It's the only opportunity for some testing for me...

I spoke to some people with more knowledge about the Test environment and now i know:

- LACP is not used here on the switches, nor any other Portchannel

- Neither is Netapp used (It's just a small Qnap :/)

Too summarize:

We're seeing the slow NFS-mounts on three hosts, independent from

- Server Model

- Network Architecture

- Storage Vendor/Model

GreyhoundHH · ‎07-07-2015

Now it's getting interesting...

I've just created a new standard vSwitch and created the Storage-VMkernel Port on it. Boot/NFS-mount is now back to normal...

To me it seems like an issue with the Cisco VEM then, as this is now out of play.

RyanH84 · ‎07-07-2015

That is really interesting!!

When you upgraded the hosts via VUM, did you upgrade the 1000v at all? Were there/are there any updates for the 1000v available?

If you SSH onto your host, what version of the cisco VEM are you running?

esxcli software vib list | grep cisco

vem version

Also, what does the 1000v appliance say the product version is? According to this KB you should be on at least cross_cisco-vem-v144-4.2.1.1.5.2.0-3.1.2

I just checked my (albeit 5.5 host) and I'm running 4.2.1.2.2.1a.0-3.2.1 which is correct for me.

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk

GreyhoundHH · ‎07-07-2015

We didn't install any VEM-Updates..

~ # vem version

Running esx version -2323236 x86_64

VEM Version: 4.2.1.2.2.3.0-3.1.1

VSM Version: 4.2(1)SV2(2.3)

System Version: VMware ESXi 5.1.0 Releasebuild-2323236

RyanH84 · ‎07-07-2015

Looks like you are on the right version (if not higher!).

I'd definitely point out your findings to VMware though (or Cisco) this helps narrow it down for sure. I'm not too sure what else I can do to help you now, you need their support to really dig in to the issue. I'd be having a look at 1000v troubleshooting documentation now for sure.

We use 1000v here but I'm no expert although I'm happy to try and help further if you need it!

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk

GreyhoundHH · ‎07-07-2015

I'm having a webex today with VMware Support and I will surely speak about that. I think they have the opportunity to open a case with Cisco directly if necessary.

Thanks so far for your efforts 😃

I'll keep this updated as I get new information or a resolution.

alucious · ‎11-17-2015

Hi ,

I wanted to know if you ever found out the resolution for this ?

Regards,

Anupam

GreyhoundHH · ‎11-23-2015

It took ages, but In a second attempt to analyse/fix the issue with VMware/Cisco we were able to fix the issue in our staging environment.

Cisco told us to do this:

--------------------------

I have done a brief search and found that this could be related to the VEM programming which takes some time. Hence I would suggest you to please use system VLANs for accessing your NFS Shares

For example :

On each of the veth port-profile which is being used by a vmkernel-port ( that is being used for accessing the NFS share ), please include the command

#system vlan x

Also check the Uplinks that are being used;

On the Nexus 1000v Uplink port-profile include the same command or

#system vlan add x

Once the config has been changed, please try to reboot the host again. Try it a couple of time as the first attempt may not be successful as the VEM is still being programmed

--------------------------

It instantly helped on the next reboot. Final proof in our production environment is not performed yet.

Regards

All

Slow NFS mount after Update to 5.1 U3