Clustered Failover, NetApp, FibreChannel

dballing · ‎07-18-2007

I touched on this before with a similar question, but we've gotten more/better information to work with so I'd like to revisit it.

During our maintenance window this morning, we tested the cluster-failover of our NetApp 3020HA solution. Our linux boxes all "worked like a champ" and didn't miss a beat.

Here's our mpath config:

\# esxcfg-mpath -l

Disk vmhba1:0:0 /dev/sda (20480MB) has 4 paths and policy of Fixed

FC 6:1.0 2100001125925a18<->500a09819677b3f8 vmhba1:0:0 On active preferred

FC 6:1.0 2100001125925a18<->500a09839677b3f8 vmhba1:1:0 On

FC 6:1.1 2100001125925a19<->500a09819677b3f8 vmhba2:0:0 On

FC 6:1.1 2100001125925a19<->500a09839677b3f8 vmhba2:1:0 On

Disk vmhba1:0:5 /dev/sdb (1945600MB) has 4 paths and policy of Fixed

FC 6:1.0 2100001125925a18<->500a09819677b3f8 vmhba1:0:5 On active preferred

FC 6:1.0 2100001125925a18<->500a09839677b3f8 vmhba1:1:5 On

FC 6:1.1 2100001125925a19<->500a09819677b3f8 vmhba2:0:5 On

FC 6:1.1 2100001125925a19<->500a09839677b3f8 vmhba2:1:5 On

Disk vmhba1:0:6 /dev/sdc (1945600MB) has 4 paths and policy of Fixed

FC 6:1.0 2100001125925a18<->500a09819677b3f8 vmhba1:0:6 On active preferred

FC 6:1.0 2100001125925a18<->500a09839677b3f8 vmhba1:1:6 On

FC 6:1.1 2100001125925a19<->500a09819677b3f8 vmhba2:0:6 On

FC 6:1.1 2100001125925a19<->500a09839677b3f8 vmhba2:1:6 On

.... fairly straightforward. We have to talk to the primary head because of the way NetApp's clustering works (we can't "balance") it as some previous threads have talked about.

So our methodology here was that we:

1.) Had a running ESX server talking to its primary path/head

2.) Issued a failover

3.) Noted that the ESX server continued to run (it could still see its own boot LUN)

4.) Started a basic VM which proceeded to boot up, seemingly fine.

5.) Issued a giveback

... it was at the giveback that we noticed the failure condition. We started seeing lots of errors like:

Jul 18 07:29:50 vmc120 vmkernel: 0:00:18:12.753 cpu3:1036)WARNING: SCSI: 5519: Failing I/O due to too many reservation conflicts

Jul 18 07:29:50 vmc120 vmkernel: 0:00:18:12.753 cpu3:1036)WARNING: SCSI: 7916: status SCSI reservation conflict, rstatus #c0de01 for vmhba1:0:5. residual R 919, CR 0, ER 3

Jul 18 07:29:50 vmc120 vmkernel: 0:00:18:12.753 cpu3:1036)WARNING: FS3: 4008: Reservation error: SCSI reservation conflict

.... so... what are we missing here? The SAN architecture itself appears to be fine, it's simply that the VMs themselves and/or ESX itself is "not liking" the transition from one head to another.

We've got the LVM.EnableResignature set to "1", because I thought I'd read somewhere that this might have been the issue, but it did not, obviously, solve the problem.

Any thoughts? Anyone else got a configuration like we're describing that is working through a head failover?

dballing · ‎07-25-2007

http://now.netapp.com/NOW/knowledge/docs/hba/fcp_esx/fcpesxhu30/html/software/rnote/cautions.shtml

specifically....

"Linux guest operating systems not supported in a High Availability (HA) environment

Linux guest operating systems are currently not supported in a HA environment with NetApp storage systems."

WTF?

BINC-HCN · ‎09-11-2007

Just wondering if NetApp or VMware solved this situation. It seems as if we have run into a similar situation and was curious as to what your solution may be.

dballing · ‎09-13-2007

Here's what we learned along the way:

\- Despite any documentation to the contrary from VMware or Network Appliance, the proper LUN type for your boot-from-SAN LUN is "Linux". "VMWare" LUN type is exclusively for storage.

\- They don't actually support automatic failover. You have to do lun resets to "wake up" the ESX server and get it to recognize the alternate path during that failure mode. With the boot LUN set to "Linux" you will at least have access to the service console, but your VMs will all freeze until you reset lun (and then reset lun on the way back to the primary head)

Amusingly, if you're only testing (say, during maintenance) with one single ESX host, the failover will work just fine (it's a locking related issue). So we were bouncing our single host back and forth from head to head just fine. The following week we were ready for our "live test"... and immediately had 80 virtuals gasping for air, and (worse) VMware took about an hour or so to sort out what happened(!?!). Ugh.

So where we stand at this point is that we can "kinda sorta" survive a failover, and it appears to be all either VMware or NetApp will support, despite their compatibility guide making no reference to that limitation at all, and despite us pointing out to them that we spent about half a million dollars in hardware for what they tell us is a supported platform which turns out to have a significant, undocumented, gap in the support.

So I guess I should mark my question as answered, but "not to my satisfaction". Hope this answers yours.

All