dballing
Contributor
Contributor

Clustered Failover, NetApp, FibreChannel

I touched on this before with a similar question, but we've gotten more/better information to work with so I'd like to revisit it.

During our maintenance window this morning, we tested the cluster-failover of our NetApp 3020HA solution. Our linux boxes all "worked like a champ" and didn't miss a beat.

Here's our mpath config:

\# esxcfg-mpath -l

Disk vmhba1:0:0 /dev/sda (20480MB) has 4 paths and policy of Fixed

FC 6:1.0 2100001125925a18<->500a09819677b3f8 vmhba1:0:0 On active preferred

FC 6:1.0 2100001125925a18<->500a09839677b3f8 vmhba1:1:0 On

FC 6:1.1 2100001125925a19<->500a09819677b3f8 vmhba2:0:0 On

FC 6:1.1 2100001125925a19<->500a09839677b3f8 vmhba2:1:0 On

Disk vmhba1:0:5 /dev/sdb (1945600MB) has 4 paths and policy of Fixed

FC 6:1.0 2100001125925a18<->500a09819677b3f8 vmhba1:0:5 On active preferred

FC 6:1.0 2100001125925a18<->500a09839677b3f8 vmhba1:1:5 On

FC 6:1.1 2100001125925a19<->500a09819677b3f8 vmhba2:0:5 On

FC 6:1.1 2100001125925a19<->500a09839677b3f8 vmhba2:1:5 On

Disk vmhba1:0:6 /dev/sdc (1945600MB) has 4 paths and policy of Fixed

FC 6:1.0 2100001125925a18<->500a09819677b3f8 vmhba1:0:6 On active preferred

FC 6:1.0 2100001125925a18<->500a09839677b3f8 vmhba1:1:6 On

FC 6:1.1 2100001125925a19<->500a09819677b3f8 vmhba2:0:6 On

FC 6:1.1 2100001125925a19<->500a09839677b3f8 vmhba2:1:6 On

.... fairly straightforward. We have to talk to the primary head because of the way NetApp's clustering works (we can't "balance") it as some previous threads have talked about.

So our methodology here was that we:

1.) Had a running ESX server talking to its primary path/head

2.) Issued a failover

3.) Noted that the ESX server continued to run (it could still see its own boot LUN)

4.) Started a basic VM which proceeded to boot up, seemingly fine.

5.) Issued a giveback

... it was at the giveback that we noticed the failure condition. We started seeing lots of errors like:

Jul 18 07:29:50 vmc120 vmkernel: 0:00:18:12.753 cpu3:1036)WARNING: SCSI: 5519: Failing I/O due to too many reservation conflicts

Jul 18 07:29:50 vmc120 vmkernel: 0:00:18:12.753 cpu3:1036)WARNING: SCSI: 7916: status SCSI reservation conflict, rstatus #c0de01 for vmhba1:0:5. residual R 919, CR 0, ER 3

Jul 18 07:29:50 vmc120 vmkernel: 0:00:18:12.753 cpu3:1036)WARNING: FS3: 4008: Reservation error: SCSI reservation conflict

.... so... what are we missing here? The SAN architecture itself appears to be fine, it's simply that the VMs themselves and/or ESX itself is "not liking" the transition from one head to another.

We've got the LVM.EnableResignature set to "1", because I thought I'd read somewhere that this might have been the issue, but it did not, obviously, solve the problem.

Any thoughts? Anyone else got a configuration like we're describing that is working through a head failover?

0 Kudos
22 Replies
mikeddib
Enthusiast
Enthusiast

We have a similar configuration with both a NetApp 3020 and 3050 cluster. We had an issue 2 weeks ago where we needed to failover and then perform a giveback and had the same results you mentioned. We haven't gotten into testing this further to determine a possible solution but we will and when we do will be sure to post back.

Can you post on how you recovered from the issue? I found the following post was describing our scenario and we believe the last step got us back online. If you could confirm you did the same or if you took a different route I would appreciate it.

http://www.vmware.com/community/thread.jspa?messageID=643655&#643655

dballing
Contributor
Contributor

We recovered simply by giving back and rebooting the ESX blade.

Talking with VMWare support yesterday, we confirmed that what is -- essentially -- happening is that:

1.) The ESX kernel is handling the MPIO transition just fine (it's a boot-from-SAN blade, the logging continues to work after the transition, etc., etc.)

2.) But the running VMs aren't told about the transition.... so they still attempt to talk directly via a FC path which is no longer valid.

A \*new* VM would pick up and talk to the "currently active" path (which is what happened in our test, because we booted the VM in the failover state), but when you do a giveback, it'd then cease to be valid.

It's simply that the VMs aren't handling the transition. Why is a whole different story....

Cheers,

D

0 Kudos
stvkpln
Virtuoso
Virtuoso

That seems a bit suspicious, but it may be due to the mode you're running your clusters in.. Do you know offhand? We run all of our 6070C's in single_image, which allows to effectively run things in tandem, thus you see multiple paths to the LUN, presented from both Filers.. And you can see the distinct paths. The big benefit here is that you don't need to specify those goofy standby adapters.

As an example, here's the output from one of my test boxes:

Disk vmhba1:0:1 /dev/sda (512078MB) has 4 paths and policy of Fixed

FC 6:14.0 210000e08b8508fc<->500a0984877938a0 vmhba1:0:1 On active preferred

FC 6:14.0 210000e08b8508fc<->500a098977938a0 vmhba1:1:1 On

FC 6:14.1 210100e08ba508fc<->500a0981877938a0 vmhba2:0:1 On

FC 6:14.1 210100e08ba508fc<->500a0983977938a0 vmhba2:1:1 On

Disk vmhba1:0:2 /dev/sdb (512078MB) has 4 paths and policy of Fixed

FC 6:14.0 210000e08b8508fc<->500a0984877938a0 vmhba1:0:2 On

FC 6:14.0 210000e08b8508fc<->500a0982977938a0 vmhba1:1:2 On active preferred

FC 6:14.1 210100e08ba508fc<->500a0981877938a0 vmhba2:0:2 On

FC 6:14.1 210100e08ba508fc<->500a0983977938a0 vmhba2:1:2 On

Disk vmhba1:0:3 /dev/sdc (512078MB) has 4 paths and policy of Fixed

FC 6:14.0 210000e08b8508fc<->500a0984877938a0 vmhba1:0:3 On

FC 6:14.0 210000e08b8508fc<->500a0982977938a0 vmhba1:1:3 On

FC 6:14.1 210100e08ba508fc<->500a0981877938a0 vmhba2:0:3 On active preferred

FC 6:14.1 210100e08ba508fc<->500a0983977938a0 vmhba2:1:3 On

Disk vmhba1:0:4 /dev/sdd (512078MB) has 4 paths and policy of Fixed

FC 6:14.0 210000e08b8508fc<->500a0984877938a0 vmhba1:0:4 On

FC 6:14.0 210000e08b8508fc<->500a0982977938a0 vmhba1:1:4 On

FC 6:14.1 210100e08ba508fc<->500a0981877938a0 vmhba2:0:4 On

FC 6:14.1 210100e08ba508fc<->500a0983977938a0 vmhba2:1:4 On active preferred

Disk vmhba2:0:20 /dev/sde (512078MB) has 4 paths and policy of Fixed

FC 6:14.1 210100e08ba508fc<->500a0981877938a0 vmhba2:0:20 On active preferred

FC 6:14.1 210100e08ba508fc<->500a0983977938a0 vmhba2:1:20 On

FC 6:14.0 210000e08b8508fc<->500a0984877938a0 vmhba1:0:20 On

FC 6:14.0 210000e08b8508fc<->500a0982977938a0 vmhba1:1:20 On

Notice how I've got two distinct Filers displayed here, whereas with your multipath output, it's showing up as a single unit. I think switching to single_image mode for yourc cluster would go a long way in resolving, as the path your VM's were looking for would no longer "disappear". You may also want to engage NetApp support as well... I'm sure they'd tell you a similar thing.

Hope that helps!

-Steve
0 Kudos
mikeddib
Enthusiast
Enthusiast

I am actually at a VMUG event in New England and had a chance to bounce this off of a NetApp engineer here. With the details I gave him he pointed to a few things quickly.

1) Are you running the latest version of OnTap? I told him we were 7.1.1 and he apologized for giving the stock 'run the latest version' answer but said it was something we should at least strongly consider.

2) He mentioned the NetApp Host Attach Kit for ESX 3.0 which I am fairly certain we have not implemented. He explained that these tools set certain parameters such as Queue Length, etc, that could help as well

3) Lastly, he said we should be using SII (single image mode) and not partner mode

As I did a quick search on some of this I came across the following link. Seems like NetApp is making the rounds at these events and the pitch is consistent.

http://frenchfamily.org/hunter/?p=175

0 Kudos
dballing
Contributor
Contributor

I actually don't see a lot of difference between your output and my own... what am I supposed to be looking for that I'm not seeing? Smiley Happy

Also, how do I tell if it's in single_image mode or not?

0 Kudos
dballing
Contributor
Contributor

Oh, I forgot to mention. We're running 7.2.2.

0 Kudos
stvkpln
Virtuoso
Virtuoso

The key differentiator between our respective outputs isn't terribly intuitive, so let me use bold Smiley Happy

Mine:

Disk vmhba1:0:1 /dev/sda (512078MB) has 4 paths and policy of Fixed

FC 6:14.0 210000e08b8508fc<->500a0984[b]87[/b]7938a0 vmhba1:0:1 On active preferred

FC 6:14.0 210000e08b8508fc<->500a0982[b]97[/b]7938a0 vmhba1:1:1 On

FC 6:14.1 210100e08ba508fc<->500a0981[b]87[/b]7938a0 vmhba2:0:1 On

FC 6:14.1 210100e08ba508fc<->500a0983[b]97[/b]7938a0 vmhba2:1:1 On

Yours:

Disk vmhba1:0:0 /dev/sda (20480MB) has 4 paths and policy of Fixed

FC 6:1.0 2100001125925a18<->500a0981[b]96[/b]77b3f8 vmhba1:0:0 On active preferred

FC 6:1.0 2100001125925a18<->500a0983[b]96[/b]77b3f8 vmhba1:1:0 On

FC 6:1.1 2100001125925a19<->500a0981[b]96[/b]77b3f8 vmhba2:0:0 On

FC 6:1.1 2100001125925a19<->500a0983[b]96[/b]77b3f8 vmhba2:1:0 On

Notice, with mine, two paths are going to the unit that have the 87 in the WWPN, while the other two are going to unit with 97 in the WWPN. This is what using single_image mode does.. It presents the clustered pair as a single WWN, with unique WWPN's. Means you don't need to worry about defining partner paths, etc.. Much nicer.

To check what mode you're runing in, the command is: fcp show cfmode

-Steve
dballing
Contributor
Contributor

answering my own question:

filer1> fcp show cfmode

fcp show cfmode: single_image

... so I'm already using SII.

0 Kudos
stvkpln
Virtuoso
Virtuoso

Can you paste the output of fcp show adapters from the Filer that owns the WWPN 500a09819677b3f8?

-Steve
0 Kudos
dballing
Contributor
Contributor

Your wish is my command. Smiley Happy

filer1> fcp show adapters

Slot: 0c

Description: Fibre Channel Target Adapter 0c (Dual-channel, QLogic 2322 (2362) rev. 3)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:86:77:b3:f8 (500a09808677b3f8)

FC Portname: 50:0a:09:81:96:77:b3:f8 (500a09819677b3f8)

Standby: No

Slot: 0a

Description: Fibre Channel Target Adapter 0a (Dual-channel, QLogic 2322 (2362) rev. 3)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:86:77:b3:f8 (500a09808677b3f8)

FC Portname: 50:0a:09:83:96:77:b3:f8 (500a09839677b3f8)

Standby: No

0 Kudos
stvkpln
Virtuoso
Virtuoso

errrrrrr I should have had you run that from both Filers... Also, as a point of reference, I'm going to assume that your ESX hosts are zoned to both Filers at the FC switch?

-Steve
0 Kudos
dballing
Contributor
Contributor

I thought you'd ask that... I should've just run with my gut feeling. Smiley Happy

Also, yes, they are zoned properly (at least I believe so... ESX itself continues to run and write to the drives post-failover, so clearly -- at the hardware layer at least -- everything is functioning properly)

filer2> fcp show adapters

Slot: 0c

Description: Fibre Channel Target Adapter 0c (Dual-channel, QLogic 2322 (2362) rev. 3)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:86:77:b3:f8 (500a09808677b3f8)

FC Portname: 50:0a:09:81:96:77:b3:f8 (500a09819677b3f8)

Standby: No

Slot: 0a

Description: Fibre Channel Target Adapter 0a (Dual-channel, QLogic 2322 (2362) rev. 3)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:86:77:b3:f8 (500a09808677b3f8)

FC Portname: 50:0a:09:83:96:77:b3:f8 (500a09839677b3f8)

Standby: No

0 Kudos
dballing
Contributor
Contributor

FWIW, the IDs on my two heads both appear to be the same, which seems to jive with being in single-image mode.... what you're showing, with each port having its own ID number, seemingly (four distinct WWNs on the NetApp side) seems like it's NOT in single-image mode, at least how I understand it to be...

0 Kudos
stvkpln
Virtuoso
Virtuoso

See, to me... That just doesn't seem right. No two HBA's should have the same WWPN, unless it's doing some form of masquerading. I think that's the crux of your problem, the lack of distinction of paths for the Filers. Perhaps it's different with the 3000 series vs. the 6000's, which is what we run for the SAN arrays... but, I'd definitely look at that configuration and get NetApp involved to verify that that is, in fact, how it should be setup.. I'd be curious to know why.

We are, most definitely, running single_image mde. It's the default in 7.2, and it's, by far, the preferred mode when using 4Gbit HBA's. Delving deeper, what single_image does, as I understand it, is allows both Filer heads to advertise the same WWN, so that it looks like a single unit for failover purposes; otherwise you either need to have partner or standby interfaces allocated that effectively masquerade the WWPN. What you end up seeing is that the Filers are aware of one another's configurations; if you try to configure an iGroup that has a member that already has a LUN ID mapped to it from the other head, it will \*not* let you do it, as you'd expect.. since you can't present the same LUN ID to the same node with the same WWN, and expect it not to get confused.

Here's he the output of the relevant interfaces on my two filers. I'll bold the relevant info regarding what makes single_image... single_image, and italicize the differentiators!

Filer1:

Slot: 5a

Description: Fibre Channel Target Adapter 5a (Dual-channel, QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:87:79:38:a0 (500a0980877938a0)[/b]

FC Portname: 50:0a:09:81:87:79:38:a0 (500a0981877938a0)[/i]

Standby: No

Slot: 6a

Description: Fibre Channel Target Adapter 6a (Dual-channel, QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:87:79:38:a0 (500a0980877938a0)[/b]

FC Portname: 50:0a:09:83:87:79:38:a0 (500a0983877938a0)[/i]

Standby: No

Filer2:

Slot: 5b

Description: Fibre Channel Target Adapter 5b (Dual-channel, QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:87:79:38:a0 (500a0980877938a0)[/b]

FC Portname: 50:0a:09:82:97:79:38:a0 (500a0982977938a0)[/i]

Standby: No

Slot: 6b

Description: Fibre Channel Target Adapter 6b (Dual-channel, QLogic 2432 (2462) rev. 2)

Adapter Type: Local

Status: ONLINE

FC Nodename: 50:0a:09:80:87:79:38:a0 (500a0980877938a0)[/b]

FC Portname: 50:0a:09:84:97:79:38:a0 (500a0984977938a0)[/i]

Standby: No

Keep in mind, that NetApp storage arrays do create a bit of a tangle with maintaining storage on the ESX side, but the FCP Attach Kit for ESX does wonders for making it 10x times easier to work. Make sure and apply patch 1000039 (http://www.vmware.com/support/vi3/doc/esx-1000039-patch.html), as it has a fix for the sanlun, which is a app which essentially looks at the LUN info coming from the Filer and determines which path is a proxy vs. non-proxy path..

-Steve
0 Kudos
dballing
Contributor
Contributor

We did open a Netapp case on it (2368014), but haven't heard back yet.

See, the confusing part (well, I say "the" .. I mean "one of many") is that a lot of the things you're describing as potential fixes (ESX patches, FCP host attach kit for ESX, etc.) are all ESX-side .... none of those would affect the "symptoms" that I see on the NetApp side (e.g., the identical port numbering between the heads).

And it confuses me even more that... it WORKS at the operating system level... the linux boxes on the SAN sort it all out, the ESX kernel itself sorts it out... it's only the VMs that choke. That's what makes this so damned confusing.

What's this host attach kit you refer to, though? I'm not sure I've ever seen any info on that.... maybe we'll do that, and that patch in the upcoming Wed. morning maintenance window and see if that solves it...

But again, how would either of those change the fact that the heads themselves report identical port names?

Ugh, brain hurts.

0 Kudos
stvkpln
Virtuoso
Virtuoso

As I said, you need to get with NetApp and verify that your Filers are actually configured properly, first and foremost. Applying any ESX-side patches, software, etc will not resolve your problems; you need to figure out why both Filers have the same WWPN on the adapters... That's not so good.

-Steve
0 Kudos
wtreutz
Enthusiast
Enthusiast

Hi,

I had seen similarly issues / messages in a constellation 2x FAS3050 ONTAP7.1.1.1 act as a METROCLUSTER connected to 2 FC-Fabrics (every build with 2xBROCADE 200E) connected to 6x VMware ESX Server 3.0.1 each with 2 FC-Controller (each FC-Controller connected to 1 FC-Fabric). Daily work was OK, planed TAKEOVER works OK, planed GIVEBACK show the following errormessages under /var/log/vmkernel on all ESX-Servers.

Jan 3 05:00:54 esx01 vmkernel: 0:00:44:18.340 cpu1:1036)WARNING: SCSI: 5519: Failing I/O due to too many reservation conflicts

Jan 3 05:00:54 esx01 vmkernel: 0:00:44:18.340 cpu1:1036)WARNING: SCSI: 7916: status SCSI reservation conflict, rstatus #c0de01 for vmhba1:0:0. residual R 919, CR 0, ER 3

Jan 3 05:00:54 esx01 vmkernel: 0:00:44:18.340 cpu1:1036)WARNING: FS3: 4008: Reservation error: SCSI reservation conflict

Jan 3 05:00:54 esx01 vmkernel: 0:00:44:18.666 cpu2:1034)SCSI: vm 1034: 5509: Sync CR at 0

Jan 3 05:00:54 esx01 vmkernel: 0:00:44:18.666 cpu2:1034)WARNING: SCSI: 5519: Failing I/O due to too many reservation conflicts

Jan 3 05:00:54 esx01 vmkernel: 0:00:44:18.666 cpu2:1034)WARNING: SCSI: 5615: status SCSI reservation conflict, rstatus 0xc0de01 for vmhba1:0:0. residual R 919, CR 0, ER 3

Jan 3 05:00:54 esx01 vmkernel: 0:00:44:18.666 cpu2:1034)FSS: 343: Failed with status 0xbad0022 for f530 28 2 454a17cd 6055235c 15001471 a2dd1217 4 1 0 0 0 0 0

Jan 3 05:00:55 esx01 vmkernel: 0:00:44:19.233 cpu3:1036)SCSI: vm 1036: 5509: Sync CR at 64

Jan 3 05:00:55 esx01 vmkernel: 0:00:44:19.567 cpu2:1034)SCSI: vm 1034: 5509: Sync CR at 64

Jan 3 05:00:56 esx01 vmkernel: 0:00:44:20.241 cpu2:1036)SCSI: vm 1036: 5509: Sync CR at 48

Jan 3 05:00:56 esx01 vmkernel: 0:00:44:20.487 cpu2:1034)SCSI: vm 1034: 5509: Sync CR at 48

Jan 3 05:00:57 esx01 vmkernel: 0:00:44:21.203 cpu2:1036)SCSI: vm 1036: 5509: Sync CR at 32

Jan 3 05:00:57 esx01 vmkernel: 0:00:44:21.394 cpu2:1034)SCSI: vm 1034: 5509: Sync CR at 32

Jan 3 05:00:58 esx01 vmkernel: 0:00:44:22.165 cpu2:1036)SCSI: vm 1036: 5509: Sync CR at 16

Jan 3 05:00:58 esx01 vmkernel: 0:00:44:22.417 cpu2:1034)SCSI: vm 1034: 5509: Sync CR at 16

Jan 3 05:00:58 esx01 vmkernel: 0:00:44:23.107 cpu2:1036)SCSI: vm 1036: 5509: Sync CR at 0

Jan 3 05:00:58 esx01 vmkernel: 0:00:44:23.107 cpu2:1036)WARNING: SCSI: 5519: Failing I/O due to too many reservation conflicts

Jan 3 05:00:58 esx01 vmkernel: 0:00:44:23.107 cpu2:1036)WARNING: SCSI: 7916: status SCSI reservation conflict, rstatus #c0de01 for vmhba1:0:0. residual R 919, CR 0, ER 3

Jan 3 05:00:58 esx01 vmkernel: 0:00:44:23.107 cpu2:1036)WARNING: FS3: 4008: Reservation error: SCSI reservation conflict

Jan 3 05:00:59 esx01 vmkernel: 0:00:44:23.393 cpu2:1034)SCSI: vm 1034: 5509: Sync CR at 0

Jan 3 05:00:59 esx01 vmkernel: 0:00:44:23.393 cpu2:1034)WARNING: SCSI: 5519: Failing I/O due to too many reservation conflicts

Jan 3 05:00:59 esx01 vmkernel: 0:00:44:23.393 cpu2:1034)WARNING: SCSI: 5615: status SCSI reservation conflict, rstatus 0xc0de01 for vmhba1:0:0. residual R 919, CR 0, ER 3

Jan 3 05:00:59 esx01 vmkernel: 0:00:44:23.393 cpu2:1034)FSS: 343: Failed with status 0xbad0022 for f530 28 2 454a17cd 6055235c 15001471 a2dd1217 4 1 0 0 0 0 0

Jan 3 05:01:00 esx01 vmkernel: 0:00:44:24.244 cpu2:1036)SCSI: vm 1036: 5509: Sync CR at 64

Jan 3 05:01:00 esx01 vmkernel: 0:00:44:24.279 cpu3:1033)SCSI: vm 1033: 5509: Sync CR at 64

Jan 3 05:01:01 esx01 vmkernel: 0:00:44:25.170 cpu2:1036)SCSI: vm 1036: 5509: Sync CR at 48

After some tests, experience and knowledge obtained by reading and doing we come to the following points.

After a planed TAKEOVER we see on all ESX-Server a correct failover.

\[root@esx01 vmware]# esxcfg-mpath -l

Disk vmhba0:0:0 /dev/sda (69472MB) has 1 paths and policy of Fixed

Local 2:14.0 vmhba0:0:0 On active preferred

Disk vmhba1:0:0 /dev/sdb (204816MB) has 4 paths and policy of Fixed

FC 12:4.0 210000e08b91bad8<->500a09819627ff01 vmhba1:0:0 On active preferred

FC 12:4.0 210000e08b91bad8<->500a09818627ff01 vmhba1:1:0 Dead

FC 12:6.0 210000e08b91abd5<->500a09829627ff01 vmhba2:0:0 On

FC 12:6.0 210000e08b91abd5<->500a09828627ff01 vmhba2:1:0 Dead

The lost /dead pathes are correct.

After we initated the command GIVEBACK we see the following messages on all ESX-Server:

\[root@esx01 vmware]# esxcfg-mpath -l

Disk vmhba0:0:0 /dev/sda (69472MB) has 1 paths and policy of Fixed

Local 2:14.0 vmhba0:0:0 On active preferred

Disk vmhba1:0:0 /dev/sdb (204816MB) has 4 paths and policy of Fixed

FC 12:4.0 210000e08b91bad8<->500a09819627ff01 vmhba1:0:0 On active preferred

FC 12:4.0 210000e08b91bad8<->500a09818627ff01 vmhba1:1:0 On

FC 12:6.0 210000e08b91abd5<->500a09829627ff01 vmhba2:0:0 On

FC 12:6.0 210000e08b91abd5<->500a09828627ff01 vmhba2:1:0 On

We see, take GIVEBACK on the LUN-Level work correct, but some of the VMs are freeze in. This was those specific VMs, how are placed on that LUN, how shows the SCSI RESERVATION Errors under /var/log/vmkernel.

With the command >>vmkfstools –L targetreset /vmfs/devices/disks/vmhba1:0:0:0<< (or the specific / need LUN, how shows the SCSI Errors) we could temporarly fix the errors and all works OK until the next GIVEBACK.

OK, we see, the LUN-Failover works correct, and with a little "KICK-Start", the issue could be fixed. At the next Test after the GIVEBACK we check the LUNS with the >>esxcfg-mpath –l<< and see, all LUNs be OK. So we look “one level higher” at the partitions. The command >>fdisk -l |grep fb<< shows us all of our partitions with the ID "fb", which represents the VMFS-Partitions. Here we see, that not all vmfs-partitions are visible. So for our opinions, we "lost some signals" at the GIVEBACK ??? That would describe, why after all GIVEBACK every time a nother vmfs-DATASTORE was not accessable/visible (we had about 10 LUNs with vmfs-DATASTOREs on the NetApp -- I only posted in the example on of our 10 esxcfg-mpath -l output-lines as an example).

After our "awareness" after the >>fdisk -l |grep fb<< output, we look on all of out FC-Port-Setting. The FC-PortSetting in the FC-Controller BIOS we had set to fix, in the ESX-Console OS we had no softwareparameter to set FC-Speed or FC-TypeNode. We change the ESX-Server FC-Ports on all 4 FC-Switches to Speed 2 GB FIX (we had 2 GB FC Controller) and to F-TypeNode only and reproduce our TAKEOVER and GIVEBACK. Same error as before.

In the next step we set the 4 FC-Ports on all FC-Switches how present the connections for the NetApp FAS3050 to 2GB FIX and F-Node-Type only[/b] and reproduce the test -- the errors lost. We repeate the TAKEOVER / GIVEBACK to times more and all stay OK.

To doublecheck the change we undo the FC-Switch-Port settings for the 4 NetAPP-FC Port to default (speed=auto and F-, N-, E-TypeNode allowed) the error come back.

Finale we do the FC-Switch-Port settings an second time (speed=FIX 2GB -- F-Type-Node allowed only) and the errors "lost" the second time.

In our Situation, with change the settings at the FC-Switch-Port on the BROACDE 200E we could fix the issue. We use BROCADE Fabric OS: v5.1.0b. I now, it is not the latest, but it was a newer one that was shipped from factory an and it was the latest Version that all (ServerVendor, FC-Controller-Vender, FC-Switch-Vendor, Storage-Vendor) had support at our time end of April 2007.

sorry for the lot of lines -- hope this will help -- regards

Werner

Message was edited by:

wtreutz

btw:

we had comprised the support of VMware, NetApp, BROCADE and QLogic in our issue. It take a time to gather all the needed and wanted informations and connect and combine all the contacts and informations.

0 Kudos
dballing
Contributor
Contributor

FWIW, NetApp has agreed that "something is wonky" with them having the same WWPNs. It appears to have happened when the units were set into single image mode.

So, we'll see what comes of it.... this might be entirely a netapp issue. 😕

0 Kudos
dballing
Contributor
Contributor

OK, now get this...

NetApp is now saying, and I paraphrase here, that "Linux as a guest OS is not supported in a Clustered Failover environment, as there are certain hard-coded timeouts set, specifically, to 30 seconds which cannot easily be changed without hacks. Windows works as a guest OS because the timeouts can be altered to allow for the failover."

He's going to send me the documentation to this effect, but has anyone heard this rubbish before?

0 Kudos