VMware

This Question is Possibly Answered

1 "correct" answer available (10 pts) 2 "helpful" answers available (6 pts)
9 Replies Last post: Sep 15, 2009 9:17 PM by andriven  

NFS Failover issue posted: Mar 5, 2009 7:19 AM

Click to view JMorz's profile Lurker 2 posts since
Dec 21, 2005

Hi folks,

Wondering if anyone on the boards has experienced this particular issue, a quick run-through of our environment:

7 node cluster running VirtualCenter 2.5 Update 3

Hardware 7 x DL585 G2's(BIOS up-to-date)

Storage: NetApp FAS3070c - NFS mounts used for storage

Each host running ESX 3.5 Update 3 (4 critical patches added-on)

150 virtual machines running

5 vSwitches per host (each with 2 pNICs patched to 2 separate physical network switches (2 x Catalyst 6509)

vSwitch Configuration

Load Balancing: Route based on the Originating Virtual Port ID

Network Failover Detection: Link Status only

Notify Switches: Yes

Failback: Yes

1 vSwitch SC

1 vSwitch VMotion (private VLAN)

1 vSwitch VM Network

1 vSwitch NFS (separate VLAN)

1 vSwitch VM Network (redundant)

Have run repeated physical cable checks to ensure the vmnics are patched properly, all check out and running in their proper VLANs)

HA/DRS/VMotion all running fine.

Logged a ticket with VMware to verify our storage configuration was fine (included settings made from Netapp Best Practices guide for ESX) - confirmed running best practices.

VM's generally running fine (no disk errors/reported)

Issue:

When performing a single network switch outage in around a quarter of the VM's lose access to their VMDKS(effectively if you go on the console of the VM it will display a PXE boot message in DOS). Failover of traffic from one switch to the other can take over 15 minutes on average.

Now to replicate the issue both the "active" NFS vswitch vmnic and the 10gb Fibre connection running from the physical switch to the netapp filer from one physical switch need to be unplugged - the issue does not occur if just one is unplugged.

Tried:

Setting Failback to No

Active/Standby for NFS vswitch vmnics

Made no difference.

Have tested in a lab environment using a single ESX host and 30 dummy VM's

See the VM's hang for roughly 2 minutes before returning back to life - in the event logs a series of disk errors (symmpi) will be reported during the hang period.

Our networking group have confirmed Portfast is enabled on the ports and port security is disabled.

Any suggestions gratefully received - forgive the rather lengthy submission!

Re: NFS Failover issue

1. Mar 6, 2009 1:42 PM in response to: JMorz
Click to view pgifford's profile Novice 20 posts since
Oct 25, 2007

Hi,

Do you have the latest Netapp Host utilities installed on your hosts?

Have you verfied that the cluster failover on the filer is working correctly?

Which cfmode are you using?

What version os DOT are you running?

Can you please post your VIF configurations?

Thanks

Paul

Re: NFS Failover issue

3. May 14, 2009 10:40 PM in response to: JMorz
Click to view michael12345's profile Lurker 2 posts since
May 14, 2009
Some thing to think about in this problem is that the Netapp is not true Active/Active yes it will share Nvram but it will not share aggregate's so if you have VM's talking to Filer 0 and you drop the network path to filer 0 the system will not be able to talk to the aggregate(Lun/NFS Volume) . The Netapp will change aggregate owners only if the second Filer can see a problem.

If a request for a file comes to Filer 1 it will hand off the request to Filer 0. If Filer 0 cant talk to you thats the end of that request it will not send I/O from aggregates it owns using Filer 1.

I would still use fail over even with LACP you should have nics bonded with LACP and then a second set of nics on a second switch bonded with LACP using fail over and spaning tree blocking this way if you have a switch fail you have a second link source. Also not sure how NFS handles dropped packets.

If something has changed on the NetApp please let me know but this is how i understand it to work on my NetApp 3070 thanks.

Re: NFS Failover issue

4. May 15, 2009 5:36 AM in response to: JMorz
Click to view pgifford's profile Novice 20 posts since
Oct 25, 2007

@JMorz

Did you ever get an answer/resolution for this?

Re: NFS Failover issue

5. May 15, 2009 8:24 AM in response to: pgifford
Click to view michael12345's profile Lurker 2 posts since
May 14, 2009
No i am still working on this problem and how to over come the short falls of NFS. Some days i wish i had just done Fiber ... the added cost would be worth the less problems at this point.

Re: NFS Failover issue

6. Jul 30, 2009 11:00 AM in response to: michael12345
Click to view drobilla's profile Lurker 2 posts since
Jul 30, 2009

@michael12345

Hello Micheal,

I'm not sure it would help you, but I figure it can't hurt. Take a look at this post http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-to-help-our-mutual-nfs-customers-using-vmware.html

They actually talk of your issue in the comments.

There's also a few other good posts related with VMware ESX and NFS.

ESX Server, IP Storage, and Jumbo Frames

Setting VMware ESX vSwitch Load Balancing Policy via CLI

ESX Server, NIC Teaming, and VLAN Trunking

Good luck,

David


Re: NFS Failover issue

7. Aug 28, 2009 10:12 PM in response to: drobilla
Click to view andriven's profile Novice 23 posts since
Jan 16, 2006
Incidentally, I'm not sure if you're still having issues with this...but I can state from personal experience that NFS can be VERY resilient for cluster and/or switch failover (just did a NetApp ONTap upgrade last weekend (which involves cluster failover) to an 8 ESX host HA/DRS cluster connecting to a NetApp 3050 via NFS....zero issues with VM's).

The thread here is a little bit old so I won't go into anything deep technically...but just wanted to put out there that from my experience there's nothing technically/conceptually intractable about using NFS with VMware/NetApp.

Re: NFS Failover issue

8. Aug 31, 2009 2:13 PM in response to: andriven
Click to view drobilla's profile Lurker 2 posts since
Jul 30, 2009

Hi andriven,

Great news! I'm still in the process of building the ESX over NFS setup here. It's nice to know that everything on your side is going as planned.

If you don't mind, I'd like to ask a few questions.

a) Which version of ESX are you using? I'm working with ESX 3.5 update 4.

b) Have changed anything in the NFS Advanced Setting?

c) Have you enabled Jumbo Frames?

e) Are you configuring your ESX machines via CLI? If so, then I'd like to share with you how I do things. This way we can compare/share ideas.

Thanks,

David

Re: NFS Failover issue

9. Sep 15, 2009 9:17 PM in response to: drobilla
Click to view andriven's profile Novice 23 posts since
Jan 16, 2006

Sure...no problem....

#1 - 3.5 Update 4 as well.

#2 - yes, just the settings recommended by NetApp in TR-3428 (also put in automatically by installing the free NetApp Host Utilities for ESX as well).

#3 - no, no jumbo frames....and surprisingly no real issues around that at all.

#4 - via GUI I'm afraid....happy to review any CLI commands you're using though.

VMware Developer

SDKs, APIs, Videos, Learn and much more in the Developer community.

Learn More

Developer Sample Code

Increase your developer productivity with VMware API sample code.

Learn More

VMworld Sessions & Labs

Online access to the latest VMworld Sessions & Labs and online services.

Learn more

Purchase PSO Credits Online

Purchase credits to redeem training and consulting services online.

Buy Now

Community Hardware Software

View reported configurations or report your own.

Learn More

VMware vSphere

Come witness the next giant leap in virtualization.

Register Today

Communities