Hi,
I have a ESX 4.1 server with NFS Datastore(exported from NAS Filer). I have created Redhat5.1(32bit) VM on NFS Datastore. NFS lun is redundant on two ports. It is aggregrated on two ports on NAS Filer. Just to test failover, I am running Iozone on this VM and while IO is going on, I pull one of network cable from NAS filer.
ESX detects this failover and NFS datastore becomes inactive and remains inactive for 10 minutes. After 10 minutes, it becomes active on another port, but my VM becomes read-only FS ( all filesystem) and hence Iozone starts giving errors.
Can someone please help me out with this.
Regards
Prasad
Hi,
how is your NAS-Filer configured? Failover or NIC-teaming (if yes, does your switch support that?) How is your ESX servers network configured?
Please provide more info.
Regards
Thanks for prompt response.
I have NIC teaming created on ESX server . I have my VM attached to "VM Network 2" portgroup.
Vmnic2 and Vmnic3 are configured Active-Active.
root@esxserver1 ~]# esxcfg-vswitch -l
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch0 128 3 128 1500 vmnic0
PortGroup Name VLAN ID Used Ports Uplinks
VM Network 0 0 vmnic0
Service Console 0 1 vmnic0
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch1 128 12 128 1500 vmnic2,vmnic3
PortGroup Name VLAN ID Used Ports Uplinks
VM Network 2 0 8 vmnic2,vmnic3
NIC Team 0 1 vmnic2,vmnic3
On NAS filer; I guess it is failover. Actually It gives only 2 options while creating Link Aggreration i.e.
LACP or Static------> I selected Static
Normal or Round Robin---> I selected Normal
Regards
Prasad
Could NFS locking be the problem? I am not sure how esx handles the network fail on NFS but I know I have had some problems with my NFS stores going "inactive" because of locking. Might be something to look at.
http://docstore.mik.ua/orelly/networking_2ndEd/nfs/ch11_03.htm
ok, static is fine, LACP (dynamic) is not supported by esx.
What switches do you use between the esx server and the Storage?
some info about link aggregation:
Dell Switches --Powerconnect 5424
Forgot to mention that vmnic2 and 1st port of NAS filer is connected to Switch1
and
vmnic3 and 2nd port of NAS filer is connected to Switch2
Switches are not cascaded
Thanks
Prasad
and what did you configure in your switches?
Im not a pro in link aggregation but I read a technical cisco paper that says all links need to be connected to the same switch. If you are using multiple switches you will need some with SMLT ( split multi link trunking ) support. Don't know if your Dell switches support that.
What I would do:
- Check if both routes are correctly used when all cables are plugged in (maybe he just says everything is ok but only uses 1 active route)
- link all NICs to one switch and test if the link aggregation works like this, which is supported by your switches.
Regards
I did not do any typical configuration in my switch.
And if I connect both the links of NAS filer to the same switch, then it works fine with me. But I need to have these links on seperate switch for redudancy.
Thanks
If you connect both Links to one switch and you pull one cable you don't have the connection loss with read-only FS?
The Switch has a pretty good guess whats happening then, because normally you would have to tell him about the aggregation.
Is cascading of the switches an option for you? Maybe this way SMLT support wouldn't be needed. Maybe you should take a look in the Dell documentation.
Regards
Hi,
Yes, both links on one switch do not give problems of connection-loss and readonly FS.
No cascading is not an option for me.
Just an addtional info; when after pulling out one cable, though NFS datastorage becomes inactive, but ping works fine from ESX to NAS Filer.
Even vmkernel IP is reachable. It is only ESX which does not handle this failover quickly.
In Vswitch config, under NIC team tab; I have set my "Network Failover Detection" to Beacon Probing. this is jfyi.
Regards
Prasad
ok, you shouldn't use beacon probing. For Beacon probing you should at least have 3 NICs. With 2 NICs the ESX Server can't decide which NIC is unavailable because none of the 2 receive any broadcasts from others. As a result the ESX will also send out packets over the one NIC which can't reach the NAS and that might cause the 10 minute timeout you mentioned.
Beacon Probing:
Regards
Thanks. This really seems useful information. I have to use Beacon only. So I can think of putting one more nic.
Right now, I have all assinged all my 4 NICS as follows: So I dont have any more NIC left with me.
vmnic0 to vSwitch0(COS)
vmnic2 & vmnic3 to vSwitch1 (For Vmkernel vswitch for NAS storage)
vmnic3 to vSwitch2(For VMs)
Please look below the o/p. (This is little different than what I have posted yesterday)
*******************************************************************************************************
[root@esxserver1 ~]# esxcfg-vswitch -l
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch0 128 3 128 1500 vmnic0
PortGroup Name VLAN ID Used Ports Uplinks
VM Network 0 0 vmnic0
Service Console 0 1 vmnic0
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch1 128 4 128 1500 vmnic2,vmnic3--->(With Beacon)
PortGroup Name VLAN ID Used Ports Uplinks
NIC Team 0 1 vmnic2,vmnic3
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch2 128 2 128 1500 vmnic1
PortGroup Name VLAN ID Used Ports Uplinks
VM Network 3 0 0 vmnic1
*************************************************************************************************************
I can remove vSwitch2 and PG(VM Network 3) and then assign vmnic1 to vSwitch1. So vSwtich1 with have 3 vmnics. But I have a question. Is this OK to have to have both the portgroups under vSwtich1 i.e. "NIC Team" for vmkernel and "VM Network 3" for Virtual machines.
Thank you again for your prompt replies.
regards
Prasad
well in a productive environment I wouldn't mix NFS and VM NICs but if it's just for testing, why not
Thanks. So lets keep it seperate. Because I am doing vmware hardware certification.
Meanwhile I have opened seperate forum for "read-only" problem of GOS. http://communities.vmware.com/thread/303110
As long IO on my GOS works fine, I dont mind waiting for 10 minutes for failover to take over.
Thanks for all your help.