mandg
Contributor
Contributor

Dead status on some vmhba paths

What would cause 2 of 4 host in a cluster to show numerous paths with a 'Dead' status (see attached)? Can I simply remedy by disable/enable each specific path?

I'm suspecting that these paths may be the root cause of some guests taking a long time to reboot (30-45 miniuutes).

0 Kudos
12 Replies
troberts
VMware Employee
VMware Employee

Depending on the storage device response to the TEST_UNIT_READY SCSI command, ESX Server marks the path as on, active, standby, or dead.

Active - The path is working and is the current path being used for transferring data.

Disabled - The path has been disabled and no data can be transferred.

Standby - The path is working but is not currently used for data transfer.

Dead - The software cannot connect to the disk through this path.

When multipathing detects a dead path, it provides failover to alternate paths.

You can try rescanning to see if the path comes back. If not you've most likely got a zoning or lun presentation problem.

0 Kudos
mike_laspina
Champion
Champion

Hello,

Is your storage system using active/active SP's, if not use MRU not fixed multipathing policy.

http://blog.laspina.ca/ vExpert 2009
0 Kudos
mandg
Contributor
Contributor

Thanks for the ideas. I've performed a rescan but it completed with the paths remaning dead.

I'm not real strong on our SAN enviroment. In laymen terms does SP refer to storage processors? And I presume that the multipathing policy is something that's configured on the san switches (brocades in my case)?

0 Kudos
mcowger
Immortal
Immortal

SP is the EMC term for controllers, and yes, it stands for service processors.

Multipathing is configred on the ESX side.

You need to make sure you have all LUNs presented to all your HBAs.

--Matt

--Matt VCDX #52 blog.cowger.us
0 Kudos
mike_laspina
Champion
Champion

My apologies, I should have expanded into why you may need to change the configuration. SAN storage systems will have one or more storage processors (SP's) configured. On systems with 2 SP's you can have an active/active path which means either processor can allow access for a LUN and it will transfer control to the active path. As well some systems are configured with an active/inactive SP's this system must transfer all control to the inactive path which is much slower and can create a bad state on ESX called thrashing if you configure the multipath policy as fixed. This is because the ESX failover instance will move to the other available path faster than the SP's can transfer control to it and it can toggle back and forth.

As mcowger has indicated you need to make sure your host can see a LUN on both paths correctly and if your SAN system is active/active then you can leave the policy as fixed, otherwise you should change it before correcting the path issues.

We could have a look by posting the result of

esxcfg-info -s

and

esxcfg-mpath -l

What SAN system and model are you running?

http://blog.laspina.ca/ vExpert 2009
0 Kudos
mandg
Contributor
Contributor

Thanks for the thorough explanation. I've attached the esxcfg output from one of the affected hosts. The SAN is an HP EVA 8000.

0 Kudos
mike_laspina
Champion
Champion

The HP EVA 8000 are active/active SP's.

There is an issue in your path configuration, but it is hard to tell where. I see 4 vmfs volumes across 4 wwpn's over 2 local FC adaptor paths but there are 4 additional wwpn's which are not mapped to a vmfs volume.

There seems to be to many LUN's masked to this ESX host for what it is actually using.

What did you get for output with

esxcfg-mpath -l

http://blog.laspina.ca/ vExpert 2009
0 Kudos
mandg
Contributor
Contributor

Thanks Mike. You saved me several hours from having to verify the active/active state.

esxcfg-mpath output attached.

0 Kudos
mike_laspina
Champion
Champion

I need to know what you have as your SAN architecture.

Is this how your EVA 8000 is wired?

Can you screen shot the LUN config on the HP EVA and post it.

I can not be certain without a mask table form the HP EVA.

http://blog.laspina.ca/ vExpert 2009
0 Kudos
mandg
Contributor
Contributor

Thanks Mike. Yes, I finally verified that the pysical layout is the same as your diagram.

When you ask for a screen shot of the LUN confiog on the EVA, do you mean from the FC switches (Brocades) or the EVA itself?

0 Kudos
mike_laspina
Champion
Champion

I mean the EVA, it will help to see what disk is mapped to what hosts.

The extra paths should not be there, I have the same architecture using a ds4400 and brocade switches and I do not have any dead paths.

It could be the access LUN, is that still exposed to the hosts. It's been three years since I worked on the EVA so the grey matter is fuzzy on it.

http://blog.laspina.ca/ vExpert 2009
0 Kudos
Funtoosh
Enthusiast
Enthusiast

I know Mike is doing what is suppose to be done, but I would also see

a) if this lun is shared lun?

b) If other host can see this lun ? if yes then reload hbas to do so run "reload-hbas" and then rescan hbas.

c) I would also contact SAN admin to verify how they have presented the LUN to the host.

d) I would also check what mode the LUN was set on SAN side. With HITACHI its called "vmware" (not sure did long back).

0 Kudos