Solved: Need help for NFS troubleshooting

Phil84 · ‎04-07-2020

Hello,

I try to explain the issue in english as best as I can.

I migrated an old 5.5 cluster under vCenter Essential Plus to a new one under 6.7 vCenter Essential Plus with 3 HPE Proliant ESXi and 2 Synology NAS as NFS shared storages.

Before, it was a VMWare cluster installation with 3 ESXi 5.5up2 vCenter Essential plus with 3 HPE Proliant Gen8 ESXi and 6 HPE gigabit nics connected to 2 Synology NAS, both set with NFS v4.1 without installing the Synology NFS plugin.

With this old set, I never had any NFS trouble but it was not so powerfull, maybe the cpu power of ESXi and NAS was not enough.

With the new set, I have a VMWare cluster installation with 3 ESXi 6.7.0up3 (Build 15160138) vCenter Essential Plus with 3 HPE Proliant Gen10 ESXi and 6 HPE gigabit nics connected to 2 Synology NAS (one is an existing old NAS, and a new one, same model), both set with NFSv3 with Synology NFS plugin installed on all ESXi wich normally improves the storage.

The migration of all VMs was terrible because I experienced severe issues loosing all NFS shared datastore on two NAS Synology when I wanted to copy, clone, migrate datas.

If the NFS bandswifth use is too high, ESXi looses the NFS datastores, the only way to retrive that is to force a hard reboot of the entire HPE Proliant server.

I contacted Synology and HPE to have a tech support but they replied they never had this issue before.

HPE told me to update all firmwares, I did it but nothing better.

Synology support took a look at the debug file and confirm me an error :

2020-03-17T10:02:43+01:00 syno_vm mountd[10955]: refused mount request from 192.168.141.33 for /volume1/datastore (/volume1/datastore): illegal port 16237

They gave me a tip asking me to apply a new setting on NFS shared folder authorization to authorize NFS connections for more than 1024 ports, I did it and no more NFS illegal port error notifications in the log.

but I had a new crash just after that.

HPE had deeper look to the VMWare logs, they found something.

Here a part of an ESXi log :

2020-03-17T16:49:47.365Z cpu5:2097603)StorageApdHandlerEv: 110: Device or filesystem with identifier [cbac65d1-3f9fe9c5] has entered the All Paths Down state.

2020-03-17T16:51:11.365Z cpu0:2098644)WARNING: NFS: 337: Lost connection to the server 192.168.141.12 mount point /volume1/datastore, mounted as cbac65d1-3f9fe9c5-0000-000000000000 ("Syno_backup")

2020-03-17T16:51:11.365Z cpu0:2098644)WARNING: NFS: 337: Lost connection to the server 192.168.141.6 mount point /volume1/datastore, mounted as 25a6dc8f-63bfb2da-0000-000000000000 ("Syno")

2020-03-17T16:51:11.366Z: [vmfsCorrelator] 841914822913us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 192.168.141.12 mount point /volume1/datastore, mounted as cbac65d1-3f9fe9c5-0000-000000000000 ("Syno_backup")

2020-03-17T16:51:11.366Z: [vmfsCorrelator] 841916240720us: [esx.problem.vmfs.nfs.server.disconnect] 192.168.141.12 /volume1/datastore cbac65d1-3f9fe9c5-0000-000000000000 Syno_backup

2020-03-17T16:51:11.366Z: [vmfsCorrelator] 841914823025us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 192.168.141.6 mount point /volume1/datastore, mounted as 25a6dc8f-63bfb2da-0000-000000000000 ("Syno")

They gave me a VMWare KB : https://kb.vmware.com/s/article/2016122

I applied the patch command for all ESXi : esxcfg-advcfg -s 64 /NFS/MaxQueueDepth

But it doesn't fix the issue, it's just a workarround which did the job for a while, but when the system needs a big storage bandswith like backup or restoration process, it crashs again.

I have no firewall between servers and storages and no active firewall under Windows vCenter server.

3 dedicated nics for the vmkernel_storage on each ESXi and 3 dedicated nics for the NFS communication on each NAS, all devices with the basic MTU 1500.

If I use MTU 9000 for all devices, it worses because I can't open the NFS datastores from ESXi any more.

I use the same settings for another VMWare installation with 3 HPE Proliant Gen10 under 6.5up2 HPE custom and 2 Synology NAS as NFSv3 shared storages, but without Synology NFS plugin. It's a great powerfull cluster without any trouble.

I also use the same settings for an old VMWare 5.0 and 4.0 installation because I had to much trouble with poor ISCSI connexion with Synology NAS.

Synology developer had checked the VMWare KB articles, and to their expertise and judgement, this issue is more about ESXi side rather than NAS side based on below conditions:

1. From the workaround: Lower the MaxQueueDepth of the NFS connection, is to lower the density of an ESXi client sending I/O request, which means the variable factor is on ESXi.

2. From the NAS kernel log, they didn't see some hung task call trace, but only illegal port error messages, which seems the problem is more related to ESXi side.

3. The NFS plugin is designed to improve the copy/clone speed.

For them, the differences are not only on the NFS plugin but also ESXi version, they can't firmly judge the issue is derived from the NFS plugin, but based on the above conditions, they would think the problem is more likely derived from VMware ESXi.

I asked them to give me the way to uninstall the plugin if needed and wait their reply.

Thanks for your help if anybody already had the same bad experience.

Phil84 · ‎04-20-2020

After another crash last week (so it wasn't the Synology plugin), I decided to make some new changes :

I modified the ESXi network settings to use the HPE Proliant 369i quad port integrated nic for VMWare NFS storage instead of using the Intel E1G42ET PCIx dual port nic I had to install manually (usualy I never have to do like this, all drivers are already included so it could be the issue here).

By this way, I can use updated drivers for the NFS flow, until the old VMWare 5.5 drivers (no update possible) used by the Intel E1G42ET nics, which now only set for the vMotion flow.

I also reset the cluster settings to apply all modifications on each ESXi host.

Following these manipulations, I noticed a general improvement but I will wait at least a week of production before confirming if I resolved the issue.

A week without a crash would already be a big step forward !

View solution in original post

scott28tt · ‎04-07-2020

If you have a support contract with VMware, now would be the time to use it...

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog

Phil84 · ‎04-07-2020

I know but it's a HPE OEM license, so no VMWare support for this license.

This is why I write this article for the community.

I open a new ticket to HPE support but if anyone already had this issue, it could be nice and a great help.

berndweyand · ‎04-07-2020

maybe this whitepaper helps : https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/vmware-nfs-bestpractices...

https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/vmware-nfs-bestpractices...

i have 50+ hosts with nfs (ok its a netapp) andd they are working without problems

Phil84 · ‎04-07-2020

I'll read the whitepaper you send me, thank you.

I have also a lot of hosts using nfs without troubles, but they are not in VMWare 6.7 release, and it could be the reason why.

NetApp has also a KB with same trouble : https://kb.netapp.com/app/answers/answer_view/a_id/1029822

So it's not a question of which brand of NAS I use.

Maybe I need some more test using 'vmkping -I vmkX xx.xx.xx.xx' and 'vmkping -d -s 1472 -I vmkX xx.xx.xx.xx' as described in this article.

Phil84 · ‎04-10-2020

I'll migrate all NFS datastores V3 to V4.1 asap, following the howto provides here:

https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.storage.doc/GUID-8A929FE4-1207-4CC5...

I already did this with other ESXi installations without trouble so I think I can handle it.

I would also try to uninstall the Synology NFS plugin but their support never replied to my question : how to remove your plugin I installed via SSH ?

Is this the correct command ?

esxcli --server=example_name_server software vib remove --vibname=synology-nfs-vaai-plugin-1.2-1008.vib

And I'll replug and restart an old ESXi 5.5.0.2 to the NAS to have a better diag and compare with another device and release.

Stay tuned

Phil84 · ‎04-11-2020

Today, I uninstall the Synology NFS plugin from one ESXi, and try a recovery test job to simulate a big I/O NFS storage access.

and it works just fine ! I recover a VM of 250 Go at 45 Mbts of speed without any failure.

Then I try to reset the NFS MaxQueueDepth setting as 128 default value on the same ESXi and retry a new recovery test job but it fails.

But I don't loose the ESXi connexion to the NFS storages like before.

So I think the Synology NFS plugin is the cause of this issue, but maybe there is something else.

I open a ticket to the Nakivo Backup software support, if they have an idea of what happens here.

Work in progress !

Phil84 · ‎04-20-2020