Solved: MD3000i Virtual Disk not on prefered path due to A...

manfriday · ‎07-24-2009

Hi,

I am having some issues with my MD3000i failing over to an non-prefered path.

The MD3000i throws the old "Virtual Disk not on prefered path due to AVT/RDAC failover" error several times a day.

It happened once in a while with ESX 3.5, but is not happenging much more with version 4.

I get the following in /var/log/messages:

Jul 24 11:49:12 tdvserver1 vobd: Jul 24 11:49:12.926: 355597326227us: http://vprob.storage.redundancy.degraded Path redundancy to storage device naa.6001e4f000436ab80000067c48281b88 degraded. Path vmhba40:C2:T0:L2 is down. 2 remaining active paths. Affected datastores: "DataStore3 - MD3000i (1mb)".

Jul 24 11:49:12 tdvserver1 vobd: Jul 24 11:49:12.934: 355597333792us: http://vprob.storage.redundancy.degraded Path redundancy to storage device naa.6001e4f000438fa20000068b482818af degraded. Path vmhba40:C2:T0:L3 is down. 2 remaining active paths. Affected datastores: "DataStore4 - MD3000i (8mb)".

Jul 24 11:49:12 tdvserver1 vobd: Jul 24 11:49:12.942: 355597341892us: http://vprob.storage.redundancy.degraded Path redundancy to storage device naa.6001e4f000436ab8000017ae48d8991a degraded. Path vmhba40:C2:T0:L4 is down. 2 remaining active paths. Affected datastores: "DataStore1 - MD3000i (1mb)".

Jul 24 11:49:12 tdvserver1 vobd: Jul 24 11:49:12.949: 355597348866us: http://vprob.storage.redundancy.degraded Path redundancy to storage device naa.6001e4f000438fa20000170448d8965d degraded. Path vmhba40:C2:T0:L5 is down. 2 remaining active paths. Affected datastores: "DataStore2 - MD3000i (1mb)".

Jul 24 11:49:12 tdvserver1 vobd: Jul 24 11:49:12.995: 355597394915us: http://vprob.storage.redundancy.degraded Path redundancy to storage device naa.6001e4f000436ab800002d8149242090 degraded. Path vmhba40:C2:T0:L7 is down. 2 remaining active paths. Affected datastores: "VDIStore2 - MD3000i (1mb)".

Jul 24 11:49:13 tdvserver1 vobd: Jul 24 11:49:13.004: 355597404112us: http://vprob.storage.redundancy.degraded Path redundancy to storage device naa.6001e4f000436ab800003d4c4a2ba941 degraded. Path vmhba40:C2:T0:L8 is down. 2 remaining active paths. Affected datastores: "VDIStore1 - MD3000i (2mb)".

Jul 24 11:49:13 tdvserver1 vobd: Jul 24 11:49:13.013: 355597413405us: http://vprob.storage.redundancy.degraded Path redundancy to storage device naa.6001e4f000436ab800006a5e4a6474e4 degraded. Path vmhba40:C2:T0:L9 is down. 2 remaining active paths. Affected datastores: Unknown.

Jul 24 11:49:13 tdvserver1 vobd: Jul 24 11:49:13.020: 355597420334us: http://vprob.storage.connectivity.lost Lost connectivity to storage device naa.6001e4f000436ab800006b684a65595d. Path vmhba40:C2:T0:L10 is down. Affected datastores: "DataStore5 - MD3000i (4mb)".

Jul 24 11:49:17 tdvserver1 vobd: Jul 24 11:49:17.375: 355601775018us: http://vprob.vmfs.heartbeat.timedout 49248754-672e53f0-38b7-00151778736d VDIStore2 - MD3000i (1mb).

Jul 24 11:49:22 tdvserver1 vobd: Jul 24 11:49:22.006: 355606406026us: http://vprob.vmfs.heartbeat.recovered 49248754-672e53f0-38b7-00151778736d VDIStore2 - MD3000i (1mb).

And the following in /var/log/vmkernel:

Jul 24 11:51:03 tdvserver1 vmkernel: 4:02:48:27.601 cpu7:4206)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.6001e4f000436ab800002d8149242090" - issuing command 0x41000812bc80

Jul 24 11:51:03 tdvserver1 vmkernel: 4:02:48:27.601 cpu2:6691)NMP: nmp_CompleteRetryForPath: Retry world recovered device "naa.6001e4f000436ab800002d8149242090"

Jul 24 11:51:06 tdvserver1 vmkernel: 4:02:48:30.549 cpu0:4107)NMP: nmp_CompleteCommandForPath: Command 0x12 (0x4100081c8300) to NMP device "mpx.vmhba32:C0:T0:L0" failed on physical path "vmhba32:C0:T0:L0" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Jul 24 11:51:06 tdvserver1 vmkernel: 4:02:48:30.549 cpu0:4107)ScsiDeviceIO: 747: Command 0x12 to device "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Jul 24 11:51:06 tdvserver1 vmkernel: 4:02:48:30.693 cpu1:4097)NMP: nmp_CompleteCommandForPath: Command 0x12 (0x410008106500) to NMP device "mpx.vmhba33:C0:T0:L0" failed on physical path "vmhba33:C0:T0:L0" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Jul 24 11:51:06 tdvserver1 vmkernel: 4:02:48:30.693 cpu1:4097)ScsiDeviceIO: 747: Command 0x12 to device "mpx.vmhba33:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Jul 24 11:51:13 tdvserver1 vmkernel: 4:02:48:38.059 cpu0:4107)NMP: nmp_CompleteCommandForPath: Command 0x12 (0x41000811f800) to NMP device "mpx.vmhba32:C0:T0:L0" failed on physical path "vmhba32:C0:T0:L0" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Jul 24 11:51:13 tdvserver1 vmkernel: 4:02:48:38.059 cpu0:4107)ScsiDeviceIO: 747: Command 0x12 to device "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Jul 24 11:51:13 tdvserver1 vmkernel: 4:02:48:38.202 cpu1:4097)NMP: nmp_CompleteCommandForPath: Command 0x12 (0x4100081d2300) to NMP device "mpx.vmhba33:C0:T0:L0" failed on physical path "vmhba33:C0:T0:L0" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Jul 24 11:51:13 tdvserver1 vmkernel: 4:02:48:38.202 cpu1:4097)ScsiDeviceIO: 747: Command 0x12 to device "mpx.vmhba33:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

I have spoken with support folks from both Dell & Vmware.

The VMWare support rep said it's a Dell issue.. that VMware is seeing a path problem and doing what it is supposed to do.

The Dell guy says there is nothing wrong with the MD3000i and the issue is with VMWare.

Neither of them seem interested in helping me out any more.

I have tried using MRU (the default) & Round-Robin pathing policies and have configured iscsi on the esx hosts as per the following document:

http://www.delltechcenter.com/page/VMwareESX4.0andPowerVault+MD3000i

I can vmkping from all the hosts to all the IP's on the MD3000i with no problems.

If anyone has any insight, I would be most appreciative.

Thanks

Jason

jasonlitka · ‎07-24-2009

Hmmm... Well, if they are flipping back and forth that has nothing to do with VMWare, it's definately a Dell issue. Have you tried rehoming them to the correct controllers and then just rebooting the stupid thing?

Jason Litka

Jason Litka http://www.jasonlitka.com

View solution in original post

jasonlitka · ‎07-24-2009

Is your MD3000i running Gen1 or Gen2 firmware? If the latter, make sure that your firmware is up to date as the early versions had a nasty habit of spontaneously flipping your arrays between controllers...

Jason Litka

Jason Litka http://www.jasonlitka.com

manfriday · ‎07-24-2009

Hi Jason,

Thanks for taking the time to respond.

I am pretty positive it is Gen 2..

I updated the firmware a few weeks ago to version 07.35.22.6

I checked dells website, and as of a couple days ago there was no new firmware update for the controllers.

Thanks!

Jason

jasonlitka · ‎07-24-2009

Upgrade your controllers to 07.35.22.61 and see if the problem goes away. As I understand it, that is the only Gen2 firmware version that does not do the random controller swap thing (it's also the newest).

Jason Litka

http://www.jasonlitka.com

Jason Litka http://www.jasonlitka.com

manfriday · ‎07-24-2009

Thanks Jason.

The controllers are at 07.35.22.61.

I dropped the '1' from my last post by mistake.

The controllers are the newest version fo the firmware. Thanks.

Jason

jasonlitka · ‎07-24-2009

Do you have multiple vDisks on the MD3000i? Do they always switch off one specific controller? If you've got the newest firmware then it's possible that you might have a failing controller as my MD3000i works fine with my (3) ESXi 4.0 hosts.

Jason Litka

Jason Litka http://www.jasonlitka.com

manfriday · ‎07-24-2009

Right now there is a single Raid 5 disk group, with 10 virtual Disks.

There isnt just one controller that consistently fails over to the other.

Sometimes Controller 1 fails over to 0, and sometimes 0 fails over to 1.

I suggested the possibility that one or both controllers was going bad to the Dell support guy.

He said that all indications in the MD3000i event log show that there is no hardware problem, and thus they could not send out new controllers.

Jason

jasonlitka · ‎07-24-2009

Hmmm... Well, if they are flipping back and forth that has nothing to do with VMWare, it's definately a Dell issue. Have you tried rehoming them to the correct controllers and then just rebooting the stupid thing?

Jason Litka

Jason Litka http://www.jasonlitka.com

manfriday · ‎07-24-2009

If by rehoming you mean changing the virtual disk ownership back to it's prefered path, yes.

I do that several times a day.

I have done that manually by going thru the Modify panel in the MDM, as well as the "Manage Raid Controller modules" on the Support panel.

Also I have done it via VMWare by disabling the current path, letting it fail back over to the path it is supposed to be on, and then re-enabling the path in vmware.

The MD3000i has rebooted in the past week, while this issue has been plaguing me, but I have not rebooted it every time I put the disks back on thier prefered path.

That would be pretty dissruptive to the environment here.

Oh, and I should mention that they dont really "Flip back and forth" so much.. When there is a path failover it stays failed over until I manually put it back.

Which controller fails over however, seems random.

manfriday · ‎07-30-2009

Well, I leaned on Dell some more, and they took another look at my issue.

Looks like at least one of my controllers is flakey, so they are actually replcaing both of them.

They also told me I need to rebuild the whole MD3000i. Turns out I put too many luns in one Disk Group.

Apparently the best practice is to only hav e 3-4 luns per disk group.

malaysiavm · ‎07-30-2009

if you keep on having this error, it is something due to the access path on your ISCSI. Are you using round robin or MRU setting for your storage path?

in a event of the preferred path at the ESX level change, it will access the storage LUN through the next path which is different as the prefered path which assign to serve the LUN from your MD3000i.

The next thing you need to do to stop this alert, you need to change the prefered path on you storage controller to match with the current active path on the ESX server. If you do not do this and you choose to redistribute the LUN to prefer path, the same error may come back again in the 30 to 40 mins time. Another possible to have this happen if you had turned on the Round robin feature from ESX as it may switch the active path on the ESX server from time to time for load balance purpose. But this will cause the active path from the ESX will mismatch with the preferred path on the storage controller.

Craig

vExpert 2009

Malaysia VMware Communities -

Craig vExpert 2009 & 2010 Netapp NCIE, NCDA 8.0.1 Malaysia VMware Communities - http://www.malaysiavm.com

manfriday · ‎08-10-2009

Well, it did indeed turn out to be a Dell issue.

I had to kick them around a little, but they eventually swapped out the controllers and the mid-plane in the MD3000i, and it appears that my failover issues are resolved.

nlopezesri · ‎10-30-2009

I'm getting these same errors on my MD3000i. After they replaced the controllers did you need to change how many luns you had on it after all?

AlbertWT · ‎06-03-2010

How did you go with this mate ?

Did Dell replace the hardware in your MD3000i ? because mine is also reporting the same problem as well.

Kind Regards,

AWT

/* Please feel free to provide any comments or input you may have. */

JohnADCO · ‎06-04-2010

The to many luns in a disk group is a crock.... We run a bunch of these MD3000i's and you can put many, many luns in a disk group. This should generate no issues, and certainly no failover issues.

BradB201110141 · ‎12-17-2010

I'm dealing with multipathing issues with our ESX 4.1 and md3000i. I'm looked at the config guides, and they don't indicate that 2 vmkernel iscsi ports are required on the host, but I can't seem to get the multipathing w/o it. Could someone chime in with their config to their md3000i and whether 2 seperate networks are required for multipathing, and if so, then 2 vmkernel ports on the hosts?

SomeJoe7777 · ‎01-04-2011

Dell recommends two separate networks (separated logically by IP subnet, and separated physically by separate switches or VLANs).

For Dell's recommended configuration, you need 2 VMKernel interfaces.

Follow the MD3000i config document from Dell, it works exactly as it's supposed to:

http://www.delltechcenter.com/page/VMware+ESX+4.0+and+PowerVault+MD3000i

All

MD3000i Virtual Disk not on prefered path due to AVT/RDAC failover