Hi,
we're experiencing a strange issue on hosts, that have been upgraded to 5.1 U3 with VUM.
During boot it takes ~30s for each NFS-datastore to be mounted:
2015-07-03T13:30:14.878Z cpu16:16886)NFS: 157: Command: (remount) Server: (x.x.x.x) IP: (x.x.x.x) Path: (/XXX) Label: (XXX) Options: (None)
2015-07-03T13:30:14.878Z cpu16:16886)StorageApdHandler: 692: APD Handle 914c0a59-d8871221 Created with lock[StorageApd0x410062]
2015-07-03T13:30:44.966Z cpu32:16886)NFS: 218: NFS mount 172.16.16.50:/XXX status: Success
2015-07-03T13:30:44.984Z cpu32:16886)NFS: 157: Command: (remount) Server: (x.x.x.x) IP: (x.x.x.x) Path: (/YYY) Label: (YYY) Options: (None)
2015-07-03T13:30:44.984Z cpu32:16886)StorageApdHandler: 692: APD Handle 558dfa35-a2954b0f Created with lock[StorageApd0x410062]
2015-07-03T13:31:15.024Z cpu32:16886)NFS: 218: NFS mount 172.16.16.50:/YYY status: Success
This is an example from out staging environment with only two datastores which causes only one minute boot-delay. In our production it leads to well above 30 minutes delay which we can't tolerate.
Staging/Production uses different kind of servers, storage array but show teh same problem. Only hosts updated to U3 are affected, all other hosts boot normally.
SR with VMware ist already open (for a way too long time) but still no resolution in sight.
Any help is appreciated.
Regards
Hi,
A few quick questions:
1) What storage device is serving out NFS?
2) Has anything else changed other than the VMware upgrade? (No networking changes, etc).
3) Can you give a bit more background in to how your hosts are connected via the network to your storage device?
Hi,
just a few quick answers
1) Netapp FAS6210 and FAS6280
2) No, all other hosts are also running fine
3) Each host is using 2x 10GB Adapters (via LACP) for all traffic including storage
Kind regards
Thanks for that!
Are you able to provide the full vmkernel and vobd logs from the affected host that you gave the snippet from below?
Also, do you know if you have an NFS MaxQueueDepth setting on your hosts?
To set the NFS.MaxQueueDepth
advanced parameter using the vSphere Client:
NFS.MaxQueueDepth
What is the value set (if at all?)
Hi,
Similar kind of issue I was facing earlier , But in my scenario I was having multiple RDM attached to the VM's in cluster configuration. Hence to overcome from the problem I did the settings mention in the below article after which my ESXi host started coming in 5-6 Min which was previously taking time as 20 Min to come up.
Have a look might be similar problem as you have mention
Hi,
A shot in the dark...but is DNS working correctly on both the ESXi side as well as on the storage side?
Can the host resolve the IP addresses of the storage to names and vice versa?
Bgrds,
Finnur
Hi,
So Finnzi has gone down one route that I would have asked. I can see however, that your NFS Mount commands in vmkernel log are resolving an IP address. It is after the StorageSpdHandler creates the lock that there is the mount success around 30 seconds later.
I keep seeing this article come up in my travels which is why I asked about the MaxQueueDepth parameter. It is interested that this is how it is set in Production but not your Development servers where you are seeing the issue. Is there any chance you can set a host to have this parameter and then reboot it to take effect? It is a fairly straightforward change which could possibly help.
You mentioned your network is 2 * 10Gb configured via LACP on the switch side. Can you comment more about the VMware networking side of things? How is your virtual networking layer configured? vSwitch? dvSwitch? 1000V? Can you comment on the teaming, load balancing and fail over settings within there? (I'm just trying to get a bigger picture of how thing are configured).
Hi,
correct, we're using IP-addresses to connect to storage. So there should be no name resolution issues.
In Production we've set the MaxQueueDepth according to the mentioned KB-article and due to VMware recommending this to us in a Service Request.
Setting/Testing the parameter in the Test-environment should be no problem, but we see the issue in both environments.
In both environments the hosts are connected to a Cisco 1000v dvSwitch. Unfortunately I don't have further information about the configuration at this point.
One more difference that might be of interest:
Production is using JumboFrames from End-to-End (ESXi <-> Storage)
Test seems no to be properly configured, as I'm unable to ping the Storagearray with JumboFrames. Yesterday I played a bit with different MTU-settings but it made no difference.
Kind regards
Hi,
If you are seeing the same issue in both environments then that is a real pain.
I was interested in the networking side as there are a few things that you should have configured for your networking if you are using NFS with LACP enabled on the switch. (Active/Active uplinks, Load balance on IP hash, etc). So I was hoping we could check that side of things.
Ideally you should have your test setup the same as your Prod environment with Jumbo frames, it is recommended.
I'm slightly running out of ideas, I'm not a big NFS or Netapp person myself. Have you raised this with NetApp too? I've heard good things about their support!
Just as a background: I was not involved in planning/building this Test environment but It's the only opportunity for some testing for me...
I spoke to some people with more knowledge about the Test environment and now i know:
- LACP is not used here on the switches, nor any other Portchannel
- Neither is Netapp used (It's just a small Qnap :/)
Too summarize:
We're seeing the slow NFS-mounts on three hosts, independent from
- Server Model
- Network Architecture
- Storage Vendor/Model
Now it's getting interesting...
I've just created a new standard vSwitch and created the Storage-VMkernel Port on it. Boot/NFS-mount is now back to normal...
To me it seems like an issue with the Cisco VEM then, as this is now out of play.
That is really interesting!!
When you upgraded the hosts via VUM, did you upgrade the 1000v at all? Were there/are there any updates for the 1000v available?
If you SSH onto your host, what version of the cisco VEM are you running?
esxcli software vib list | grep cisco
vem version
Also, what does the 1000v appliance say the product version is? According to this KB you should be on at least cross_cisco-vem-v144-4.2.1.1.5.2.0-3.1.2
I just checked my (albeit 5.5 host) and I'm running 4.2.1.2.2.1a.0-3.2.1 which is correct for me.
We didn't install any VEM-Updates..
~ # vem version
Running esx version -2323236 x86_64
VEM Version: 4.2.1.2.2.3.0-3.1.1
VSM Version: 4.2(1)SV2(2.3)
System Version: VMware ESXi 5.1.0 Releasebuild-2323236
Looks like you are on the right version (if not higher!).
I'd definitely point out your findings to VMware though (or Cisco) this helps narrow it down for sure. I'm not too sure what else I can do to help you now, you need their support to really dig in to the issue. I'd be having a look at 1000v troubleshooting documentation now for sure.
We use 1000v here but I'm no expert although I'm happy to try and help further if you need it!
I'm having a webex today with VMware Support and I will surely speak about that. I think they have the opportunity to open a case with Cisco directly if necessary.
Thanks so far for your efforts 😃
I'll keep this updated as I get new information or a resolution.
Hi ,
I wanted to know if you ever found out the resolution for this ?
Regards,
Anupam
It took ages, but In a second attempt to analyse/fix the issue with VMware/Cisco we were able to fix the issue in our staging environment.
Cisco told us to do this:
--------------------------
I have done a brief search and found that this could be related to the VEM programming which takes some time. Hence I would suggest you to please use system VLANs for accessing your NFS Shares
For example :
On each of the veth port-profile which is being used by a vmkernel-port ( that is being used for accessing the NFS share ), please include the command
#system vlan x
Also check the Uplinks that are being used;
On the Nexus 1000v Uplink port-profile include the same command or
#system vlan add x
Once the config has been changed, please try to reboot the host again. Try it a couple of time as the first attempt may not be successful as the VEM is still being programmed
--------------------------
It instantly helped on the next reboot. Final proof in our production environment is not performed yet.
Regards