VMware Cloud Community
shechtl
Contributor
Contributor

NFS store inactice after reboot (and never comes back)

W have a really strange problem with ESXi 3.5 (last patch, but the problem is also with older 3.5 version).

All our stores are connected over NFS with a open-e DSS storage.

Everything worked but

if I shutdown ESX and the store together for maint. and power on both most or all stores are inactive, even if the storage is up. The esx boots faster then the storage so the storage is not up when the esx is ready

Reboot of the esx and or the storage does not help.

So there is no way to start my VMs. pinging the store from esx is ok and also vmkfping is ok.

I can also mount the nfs store from a seperat linux machine. So the configuration etc. must be ok. I changed nothing only the shutdown.

The only way I could get the store running is:

Delete all stores in esxi

reboot the storage

reboot the esxi (but have to wait until the storage is up)

configure new stores in esx.

In the esx log i could see this error message:

vmkernel: 0:00:00:17.697 cpu0:1259)NFS: 107: Command: (remount) Server: (193.168.10.1) IP: (193.168.10.1) Path: (/share/backupdsa1) Label: (iscsi1backupdsa1) Options: (None)

vmkernel: 0:00:00:48.447 cpu0:1259)WARNING: NFS: 898: RPC error 13 (RPC was aborted due to timeout) trying to get port for Mount Program (100005) Version (3) Protocol (TCP) on Server (193.168.10.1)

vmkernel: 0:00:00:48.447 cpu3:1184)WARNING: NFS: 960: Connect failed for client 0x9213a08 sock 134351240: I/O error

vmkernel: 0:00:00:48.447 cpu3:1184)WARNING: NFS: 898: RPC error 12 (RPC failed) trying to get port for Mount Program (100005) Version (3) Protocol (TCP) on Server (193.168.10.1)

I found al lot of rpc problem with ESX server but not with ESXi.

For me it looks like that ESX makes some kind of footprint of the store and if, the mounting of the store fails while booting the esx (because the storage is not up) it never mounts this store again.

Can someone validate this with nfs ?

If I have a working store and reboot the storage alone, the store comes back. Also if I boot the esxi alone while the storage is up its working,

I tried a lot of things to fix it:

1. tried to make a entry of the storage in etc/hosts

2. Removed all bonds on the storage and tried only one nic

3. Removed bonding on the esxi

4. tried it with and without VMKernel Gateway

5. Removed the DNS entries in the storage

I have to fix this problem as soon as possible. Hope someone could help

Thanks

Reply
0 Kudos
3 Replies
shechtl
Contributor
Contributor

does now one have this problem ? I cant believe it.

Reply
0 Kudos
sqian
VMware Employee
VMware Employee

With a Netbsd 5.0 NFS server, I couldn't reproduce the problem you ran into. I used the latest ESXi 3.5 update 4(VMware ESX Server 3i 3.5.0 build-153875), when booted ESXi 3.5, the nfs server was inactive. After the ESXi 3.5 host was ready, powered on the nfs server, then in ESXi 3.5 host, the nas datastore got back and worked OK.

Also I tried classic ESX 3.5 (VMware ESX Server 3.5.0 build-153875), still worked okay, couldn't reproduce your problem, when ESX host's ready while NFS server was powered off, vmkernel showed error messages as below, but after NFS server booted up, every thing got okay, NFS store got back and worked.

May 11 23:43:12 bjst-net06 vmkernel: 0:00:01:38.572 cpu3:1027)StorageMonitor: 196: vmhba0:0:0:0 status = 2/0 0x5 0x24 0x0

May 11 23:43:38 bjst-net06 vmkernel: 0:00:02:03.945 cpu3:1072)WARNING: NFS: 898: RPC error 13 (RPC was aborted due to timeout) trying to get port for Mount Program (100005) Version (3) Protocol (TCP) on Server (192.168.17.100)

May 11 23:44:38 bjst-net06 vmkernel: 0:00:03:04.065 cpu3:1072)WARNING: NFS: 898: RPC error 13 (RPC was aborted due to timeout) trying to get port for Mount Program (100005) Version (3) Protocol (TCP) on Server (192.168.17.100)

I have no any experience with open-e DSS product, but based on my experiments above, I suspected it might be related to your open-e DSS configuration. Did you did another try with other NFS server such as redhat, etc. except open-e DSS?

Reply
0 Kudos
shechtl
Contributor
Contributor

I tried it with a freeNAS server and everything worked, so it must be a problem with the DSS

Reply
0 Kudos