ESX & NetApp: To vif, or not to vif?

mattjk · ‎07-02-2008

Firstly: I said NetApp in the subject line but I'd imagine this question would also apply to lots of other IP-based SANs.

We are in process of deploying an entirely new virtualised server environment based around ESX 3.5 & a NetApp 2050A. All of our ESX storage will be IP based (mainly NFS, with a bit of iSCSI).

For redundancy, we have 2 cross-connected pSwitches dedicated to IP storage, and each of the ESX servers has 2 teamed pNICs dedicated to storage traffic (1 pNIC connected to each swtich).

My question is about how best to setup the NICs / IP addressing on the SAN to provide best performance and redundancy when operating with ESX.

The 2050A has 2 GbE NICs per controller, each of which will be connected to a different pSwitch for redundancy. As far as I can see we have two options for how we configure these:

Option 1:

Set-up a single-mode vif spanning the 2 NICs (in non-NetApp speak, this is basically an active/passive fail-over setup) with a single IP address covering the vif. Set-up ESX IP storage to connect to the vif's IP address (only one path).

If the active SAN NIC or pSwitch fails the SAN will fail-over to the second SAN NIC, keeping the same IP address (nothing has changed from the ESX host's point of view).

Option 2:

Configure a seperate IP address for each NIC (both NICs are active at once). Set-up ESX IP storage to use the first SAN NIC's IP address as the primary path to storage, and the second SAN NIC's IP address as a secondary path (fail-over) path.

If the primary-path SAN NIC dies, ESX will have to fail-over to the secondary-path SAN NIC.

As far as I can see, option #2 should offer better performance - I know ESX doesn't do iSCSI/NFS MPIO, but I assume we could alternate which SAN IP address is the primary path to the storage on each ESX host and therefore get some rudimentary load balancing across the SAN's NICs.

On the other hand, I think option #1 "seems" to be a more elegant/reliable solution and should(?) offer faster recovery times, as the SAN handles fail-over in the event of a SAN NIC or pSwitch failure - as far as ESX is concerned nothing has happened. Using this option (which is how we currently have things setup) the fail-over time is less than a second when we pull the active SAN NIC's network cable.

So, what do the experts think? Which is these options is better? What are others doing?

Thanks in advance!

Cheers, Matt