Trouble achieving true NFS redundancy (No single p...

kclarksd83 · ‎02-05-2015

I'm sorry if this is a newbie question, but we've been using iSCSI to connect to our NetApp SAN for a while now, but have decided to switch to NFS, and I'm having trouble wrapping my head around how to configure our ESXI hosts to have true redundancy, without any single point of failure.

Right now each host has 4 Gigabit ethernet ports used for SAN data (2 onboard ports, 2 on a PCI-Express NIC) - those ports are split across 2 HP 2920 gigabit switches. Each of those switches has 1 10G port. The top switch 10G port connects to the NetApp's controller 1, the 10G port on the bottom switch connects to the NetApp controller 2.

We're using port binding for the iSCSI VMK's and it seems to work fine, I don't know if it's necesarily the best way, but we've tested if any cable, nic, port, switch, or controller goes down, everything just continues to work.

My problem is trying to get that same level of redundancy using NFS. From what I can see NFS always uses one, and only one, connection to the SAN at a time.

So I've created a new VMKernel port on the same virtual switch as the iSCSI stuff, on a different subnet.

And here's a look how everything is physically connected:

This works fine to start with... If either vmnic6 or vmnic7 goes down, it will fail over to the other of the two. If the top switch goes down, it will fail over to one of vmnic6 or vmnic7, and since the switch is down, the NetApp controller will see that link is down, and fail over the LIF onto the bottom controller. So it will switch to vmnic5 -> bottom switch -> controller 2, and everything will be happy.

But the problem is if the NetApp controller 1 goes down but the switch doesn't, or if just the 10G connection from the controller to the top switch goes down. If either of those happen, the NetApp NFS LIF (10.1.0.14) will automatically fail down to the 10G port on controller 2. But from my understanding, the ESXI host won't have any idea this happened. It will still continue to try and send all its traffic out on either vmnic4 or vmnic7, neither of which have any way to get down to controller 2. But since those ports themselves aren't physically down, it will never try to move them down to vmnic5 or 6.

Another problem is if vmnic4 and vmnic7 happen to go down at the same time, but the switches and controller are up, the same problem but in reverse will happen, the ESXI host will start trying to connect on one of the bottom NICs, but the NFS LIF is still up top so won't be available.

I'm totally new to NFS and relatively new to any of this compared to a lot of you on here, so I won't be offended to hear I'm doing everything wrong, or if I'm misunderstanding something... Can anyone help me come up with a way to get the same level of redundancy out of this NFS setup? (I'm not worried at all about bandwidth, we're going to be switching to a pair of 10G nics out of the host in the near future, but the same problem will apply.)

All

Trouble achieving true NFS redundancy (No single point of failure)