Hi folks,
Wondering if anyone on the boards has experienced this particular issue, a quick run-through of our environment:
7 node cluster running VirtualCenter 2.5 Update 3
Hardware 7 x DL585 G2's(BIOS up-to-date)
Storage: NetApp FAS3070c - NFS mounts used for storage
Each host running ESX 3.5 Update 3 (4 critical patches added-on)
150 virtual machines running
5 vSwitches per host (each with 2 pNICs patched to 2 separate physical network switches (2 x Catalyst 6509)
vSwitch Configuration
Load Balancing: Route based on the Originating Virtual Port ID
Network Failover Detection: Link Status only
Notify Switches: Yes
Failback: Yes
1 vSwitch SC
1 vSwitch VMotion (private VLAN)
1 vSwitch VM Network
1 vSwitch NFS (separate VLAN)
1 vSwitch VM Network (redundant)
Have run repeated physical cable checks to ensure the vmnics are patched properly, all check out and running in their proper VLANs)
HA/DRS/VMotion all running fine.
Logged a ticket with VMware to verify our storage configuration was fine (included settings made from Netapp Best Practices guide for ESX) - confirmed running best practices.
VM's generally running fine (no disk errors/reported)
Issue:
When performing a single network switch outage in around a quarter of the VM's lose access to their VMDKS(effectively if you go on the console of the VM it will display a PXE boot message in DOS). Failover of traffic from one switch to the other can take over 15 minutes on average.
Now to replicate the issue both the "active" NFS vswitch vmnic and the 10gb Fibre connection running from the physical switch to the netapp filer from one physical switch need to be unplugged - the issue does not occur if just one is unplugged.
Tried:
Setting Failback to No
Active/Standby for NFS vswitch vmnics
Made no difference.
Have tested in a lab environment using a single ESX host and 30 dummy VM's
See the VM's hang for roughly 2 minutes before returning back to life - in the event logs a series of disk errors (symmpi) will be reported during the hang period.
Our networking group have confirmed Portfast is enabled on the ports and port security is disabled.
Any suggestions gratefully received - forgive the rather lengthy submission!
Hi,
Do you have the latest Netapp Host utilities installed on your hosts?
Have you verfied that the cluster failover on the filer is working correctly?
Which cfmode are you using?
What version os DOT are you running?
Can you please post your VIF configurations?
Thanks
Paul
Hi,
In answer to your queries got the following responses back from our storage group:
Do you have the latest Netapp Host utilities installed on your hosts?
Not sure of the answer to this one, will check with my colleague in the storage team with regards this one again
Have you verfied that the cluster failover on the filer is working correctly?
We have not had a filer failover yet As we are active/active and in a switch failure only one link in the team fails the filer does NOT failover. With LACP all the traffic going down the previously active link should be re transmitted down the remaining active link.
Which cfmode are you using?
Negotiated failover enabled (network_interface).
What version os DOT are you running?
7.2.4
Can you please post your VIF configurations?
SANsideA: 2 links, transmit 'IP Load balancing', VIF Type 'lacp' fail 'default'
VIF Status Up Addr_set
up:
e2a: state up, since 22Feb2009 20:09:10 (14+13:25:28)
mediatype: auto-10g_sr-fd-up
flags: enabled
active aggr, aggr port: e2b
input packets 5621928542, input bytes 4610125046907
input lacp packets 45337, output lacp packets 41936
output packets 6658048095, output bytes 7300071603521
up indications 2, broken indications 0
drops (if) 0, drops (link) 0
indication: up at 22Feb2009 20:09:10
consecutive 0, transitions 2
e2b: state up, since 22Feb2009 20:09:10 (14+13:25:28)
mediatype: auto-10g_sr-fd-up
flags: enabled
active aggr, aggr port: e2b
input packets 6394679077, input bytes 4682641181916
input lacp packets 45326, output lacp packets 41936
output packets 6694968077, output bytes 7436061192087
up indications 2, broken indications 0
drops (if) 0, drops (link) 0
indication: up at 22Feb2009 20:09:10
consecutive 0, transitions 2
Many thanks!
John
Some thing to think about in this problem is that the Netapp is not true Active/Active yes it will share Nvram but it will not share aggregate's so if you have VM's talking to Filer 0 and you drop the network path to filer 0 the system will not be able to talk to the aggregate(Lun/NFS Volume) . The Netapp will change aggregate owners only if the second Filer can see a problem.
If a request for a file comes to Filer 1 it will hand off the request to Filer 0. If Filer 0 cant talk to you thats the end of that request it will not send I/O from aggregates it owns using Filer 1.
I would still use fail over even with LACP you should have nics bonded with LACP and then a second set of nics on a second switch bonded with LACP using fail over and spaning tree blocking this way if you have a switch fail you have a second link source. Also not sure how NFS handles dropped packets.
If something has changed on the NetApp please let me know but this is how i understand it to work on my NetApp 3070 thanks.
@JMorz
Did you ever get an answer/resolution for this?
No i am still working on this problem and how to over come the short falls of NFS. Some days i wish i had just done Fiber ... the added cost would be worth the less problems at this point.
@[~699732]
Hello Micheal,
I'm not sure it would help you, but I figure it can't hurt. Take a look at this post http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-to-help-our-mutual-nfs-custom...
They actually talk of your issue in the comments.
There's also a few other good posts related with VMware ESX and NFS.
ESX Server, IP Storage, and Jumbo Frames
Setting VMware ESX vSwitch Load Balancing Policy via CLI
ESX Server, NIC Teaming, and VLAN Trunking
Good luck,
David
Incidentally, I'm not sure if you're still having issues with this...but I can state from personal experience that NFS can be VERY resilient for cluster and/or switch failover (just did a NetApp ONTap upgrade last weekend (which involves cluster failover) to an 8 ESX host HA/DRS cluster connecting to a NetApp 3050 via NFS....zero issues with VM's).
The thread here is a little bit old so I won't go into anything deep technically...but just wanted to put out there that from my experience there's nothing technically/conceptually intractable about using NFS with VMware/NetApp.
Great news! I'm still in the process of building the ESX over NFS setup here. It's nice to know that everything on your side is going as planned.
If you don't mind, I'd like to ask a few questions.
a) Which version of ESX are you using? I'm working with ESX 3.5 update 4.
b) Have changed anything in the NFS Advanced Setting?
c) Have you enabled Jumbo Frames?
e) Are you configuring your ESX machines via CLI? If so, then I'd like to share with you how I do things. This way we can compare/share ideas.
Thanks,
David
Sure...no problem....
#1 - 3.5 Update 4 as well.
#2 - yes, just the settings recommended by NetApp in TR-3428 (also put in automatically by installing the free NetApp Host Utilities for ESX as well).
#3 - no, no jumbo frames....and surprisingly no real issues around that at all.
#4 - via GUI I'm afraid....happy to review any CLI commands you're using though.
Hello andriven,
Sorry for the *very* late reply!
#1. ESX version?
We have several ESX machines running 3.5 and 4.1. They're all using NFS for their DataStore. Everything is very solid. The NFS service is offered by a clustered Sun / Oracle ZFS Appliance 7410. These machines have two cluster heads and when we failover from one head to the other, we don't have any issues neither at the ESX side nor at the VM side. Which is great.
#2. Special NFS configurations?
We do a little change for the NFS, see #4 below.
#3. Jumbo frames?
We haven't used Jumbo Frames too and it's not a cause for concern. As an example, one of the ESX host has a mix of 17 Windows and RedHat VMs over NFS and it's working great.
Actually, one of our network engineer told me to read an interesting article about jumbo frames. It's very interesting. But you have to forget the marketing towards the end. It's Ethernet Jumbo Frames : The Good, the Bad and the Ugly [1] by Chelsio Communications.
NOTE: All NFS traffic is on a dedicated non-routable storage-only VLAN on a pair of dedicated Cisco switches stacked together. We use two aggregated pNIC ports, one on each of the switches.
#4. CLI configuration?
We do most of the ESX configuration via CLI as most of it is done by a KickStart. So all our ESX machines are configured exactly the same. This is the NFS related portion:
sudo vmware-vim-cmd hostsvc/advopt/update NFS.HeartbeatFrequency int 12 sudo vmware-vim-cmd hostsvc/advopt/update NFS.HeartbeatMaxFailures int 10 sudo vmware-vim-cmd hostsvc/advopt/update NFS.HeartbeatTimeout int 5
sudo vmware-vim-cmd hostsvc/advopt/update Net.TcpIpHeapSize int 30 sudo vmware-vim-cmd hostsvc/advopt/update Net.TcpIpHeapMax int 120
sudo vi /etc/hosts
<hosts>
# /etc/hosts
127.0.0.1 localhost.localdomain localhost 172.22.185.65 esxhost.domain.com esxhost 172.18.0.32 nas.vlan.storage # EOF </hosts>
vmkping nas.vlan.storage
sudo esxcfg-firewall -e nfsClient
sudo esxcfg-nas -a -o nas.vlan.storage -s /name/of/your/nfs/share NFS_DATASTORE_NAME sudo esxcfg-nas -l
sudo vmware-vim-cmd hostsvc/advopt/update NFS.MaxVolumes int 32 sudo vmware-vim-cmd hostsvc/hostconfig | grep -C3 NFS.Max
sudo /etc/init.d/mgmt-vmware restart
sudo vmware-vim-cmd hostsvc/net/refresh
If you require more information, do not hesitate to ask.
HTH,
David