VMware Cloud Community
prasadmenon
Contributor
Contributor

NFS Datastorage on ESX 4.1

Hi,

I have a ESX 4.1 server with NFS Datastore(exported from NAS Filer). I have created Redhat5.1(32bit) VM on NFS Datastore. NFS lun is redundant on two ports. It is aggregrated on two ports on NAS Filer. Just to test failover, I am running Iozone on this VM and while IO is going on, I pull one of network cable from NAS filer.

ESX detects this failover and NFS datastore becomes inactive and remains inactive for 10 minutes. After 10 minutes, it becomes active on another port, but my VM becomes read-only FS ( all filesystem) and hence Iozone starts giving errors.

Can someone please help me out with this.

Regards

Prasad

0 Kudos
13 Replies
schepp
Leadership
Leadership

Hi,

how is your NAS-Filer configured? Failover or NIC-teaming (if yes, does your switch support that?) How is your ESX servers network configured?

Please provide more info.

Regards

0 Kudos
prasadmenon
Contributor
Contributor

Thanks for prompt response.

I have NIC teaming created on ESX server . I have my VM attached to "VM Network 2" portgroup.

Vmnic2 and Vmnic3 are configured Active-Active.

root@esxserver1 ~]# esxcfg-vswitch -l
Switch Name      Num Ports   Used Ports  Configured Ports  MTU     Uplinks
vSwitch0         128         3           128               1500    vmnic0

  PortGroup Name        VLAN ID  Used Ports  Uplinks
  VM Network            0        0           vmnic0
  Service Console       0        1           vmnic0

Switch Name      Num Ports   Used Ports  Configured Ports  MTU     Uplinks
vSwitch1         128         12          128               1500    vmnic2,vmnic3

  PortGroup Name        VLAN ID  Used Ports  Uplinks
  VM Network 2          0        8           vmnic2,vmnic3
  NIC Team              0        1           vmnic2,vmnic3

On NAS filer; I guess it is failover. Actually It gives only 2  options while creating Link Aggreration i.e.

LACP or Static------> I selected Static

Normal or Round Robin---> I selected Normal

Regards

Prasad

0 Kudos
damccumb
Contributor
Contributor

Could NFS locking be the problem?  I am not sure how esx handles the network fail on NFS but I know I have had some problems with my NFS stores going "inactive" because of locking.  Might be something to look at. 

http://docstore.mik.ua/orelly/networking_2ndEd/nfs/ch11_03.htm

http://communities.vmware.com/thread/89271

0 Kudos
schepp
Leadership
Leadership

ok, static is fine, LACP (dynamic) is not supported by esx.

What switches do you use between the esx server and the Storage?

some info about link aggregation:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100404...

0 Kudos
prasadmenon
Contributor
Contributor

Dell Switches --Powerconnect 5424

Forgot to mention that vmnic2 and 1st port of NAS filer is connected to Switch1

and

vmnic3 and 2nd port of NAS filer is connected to Switch2

Switches are not cascaded

Thanks

Prasad

0 Kudos
schepp
Leadership
Leadership

and what did you configure in your switches?

Im not a pro in link aggregation but I read a technical cisco paper that says all links need to be connected to the same switch. If you are using multiple switches you will need some with SMLT ( split multi link trunking ) support. Don't know if your Dell switches support that.

What I would do:

- Check if both routes are correctly used when all cables are plugged in (maybe he just says everything is ok but only uses 1 active route)

- link all NICs to one switch and test if the link aggregation works like this, which is supported by your switches.

Regards

0 Kudos
prasadmenon
Contributor
Contributor

I did not do any typical configuration in my switch.

And if I connect both the links of NAS filer to the same switch, then it works fine with me. But I need to have these links on seperate switch for redudancy.

Thanks

0 Kudos
schepp
Leadership
Leadership

If you connect both Links to one switch and you pull one cable you don't have the connection loss with read-only FS?

The Switch has a pretty good guess whats happening then, because normally you would have to tell him about the aggregation.

Is cascading of the switches an option for you? Maybe this way SMLT support wouldn't be needed. Maybe you should take a look in the Dell documentation.

Regards

0 Kudos
prasadmenon
Contributor
Contributor

Hi,

Yes, both links on one switch do not give problems of connection-loss and readonly FS.

No cascading is not an option for me.

Just an addtional info; when after pulling out one cable, though NFS datastorage becomes inactive, but ping works fine from ESX to NAS Filer.

Even vmkernel IP is reachable. It is only ESX which does not handle this failover quickly.

In Vswitch config, under NIC team tab; I have set my "Network Failover Detection" to Beacon Probing. this is jfyi.

Regards

Prasad

0 Kudos
schepp
Leadership
Leadership

ok, you shouldn't use beacon probing. For Beacon probing you should at least have 3 NICs. With 2 NICs the ESX Server can't decide which NIC is unavailable because none of the 2 receive any broadcasts from others. As a result the ESX will also send out packets over the one NIC which can't reach the NAS and that might cause the 10 minute timeout you mentioned.

Beacon Probing:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100557...

Regards

0 Kudos
prasadmenon
Contributor
Contributor

Thanks. This really seems useful information. I have to use Beacon only. So I can think of putting one more nic.

Right now, I have all assinged all my 4 NICS as follows: So I dont have any more NIC left with me.

vmnic0 to vSwitch0(COS)

vmnic2 & vmnic3 to vSwitch1 (For Vmkernel vswitch for NAS storage)

vmnic3 to vSwitch2(For VMs)

Please look below the o/p. (This is little different than what I have posted yesterday)

*******************************************************************************************************

[root@esxserver1 ~]# esxcfg-vswitch -l
Switch Name      Num Ports   Used Ports  Configured Ports  MTU     Uplinks
vSwitch0         128         3           128               1500    vmnic0

  PortGroup Name        VLAN ID  Used Ports  Uplinks
  VM Network            0        0           vmnic0
  Service Console       0        1           vmnic0

Switch Name      Num Ports   Used Ports  Configured Ports  MTU     Uplinks
vSwitch1         128         4           128               1500    vmnic2,vmnic3--->(With Beacon)

  PortGroup Name        VLAN ID  Used Ports  Uplinks
  NIC Team              0        1           vmnic2,vmnic3

Switch Name      Num Ports   Used Ports  Configured Ports  MTU     Uplinks
vSwitch2         128         2           128               1500    vmnic1

  PortGroup Name        VLAN ID  Used Ports  Uplinks
  VM Network 3          0        0           vmnic1

*************************************************************************************************************

I can remove vSwitch2 and PG(VM Network 3) and then assign vmnic1 to vSwitch1. So vSwtich1 with have 3 vmnics. But I have a question. Is this OK to have to have both the portgroups under vSwtich1 i.e. "NIC Team" for vmkernel and "VM Network 3" for Virtual machines.

Thank you again for your prompt replies.

regards

Prasad

0 Kudos
schepp
Leadership
Leadership

well in a productive environment I wouldn't mix NFS and VM NICs but if it's just for testing, why not Smiley Happy

0 Kudos
prasadmenon
Contributor
Contributor

Thanks. So lets keep it seperate. Because I am doing vmware hardware certification.

Meanwhile I have opened seperate forum for "read-only" problem of GOS. http://communities.vmware.com/thread/303110

As long IO on my GOS works fine, I dont mind waiting for 10 minutes for failover to take over.

Thanks for all your help.

0 Kudos