VMware Cloud Community
NuggetGTR
VMware Employee
VMware Employee

Failing Storage paths

Hi all,

I have an issue thats been bothering me and I am hoping someone here can guide me in the right direction. Im currently waiting for vmware support to get back to me but thought i would hop on here and see if people have some ideas.

Im seeing allot of failures of storage paths in general but when trying to failover a MSCS resources from one node to another(which ends up failing or taking a very very long time) i see the same type of path failures just an excessive amount. below is a snippet from the kernel logs. : (all these point to the RDM LUNs for the MSCS disk resources)

Jan 21 11:27:14 <ESX HOST NAME> vmkernel: 45:19:33:02.296 cpu5:4358)NMP: nmp_CompleteCommandForPath: Command 0x0 (0x410001028a40) to NMP device "naa.60060160f192280058a7f865710de011" failed on physical path "vmhba1:C0:T1:L96" H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
Jan 21 11:27:14 <ESX HOST NAME> vmkernel: 45:19:33:02.296 cpu5:4358)ScsiDeviceIO: 747: Command 0x0 to device "naa.60060160f192280058a7f865710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
Jan 21 11:27:14 <ESX HOST NAME> vmkernel: 45:19:33:02.299 cpu5:4332)NMP: nmp_CompleteCommandForPath: Command 0x0 (0x4100010954c0) to NMP device "naa.60060160f192280082c5674f710de011" failed on physical path "vmhba1:C0:T1:L90" H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
Jan 21 11:27:14 <ESX HOST NAME> vmkernel: 45:19:33:02.299 cpu5:4332)ScsiDeviceIO: 747: Command 0x0 to device "naa.60060160f192280082c5674f710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
Jan 21 11:27:14 <ESX HOST NAME> vmkernel: 45:19:33:02.300 cpu5:4332)NMP: nmp_CompleteCommandForPath: Command 0x0 (0x410001169e00) to NMP device "naa.60060160f192280083c5674f710de011" failed on physical path "vmhba1:C0:T1:L91" H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
Jan 21 11:27:14 <ESX HOST NAME> vmkernel: 45:19:33:02.300 cpu5:4332)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.60060160f192280083c5674f710de011" state in doubt; requested fast path state update...
Jan 21 11:27:14 <ESX HOST NAME> vmkernel: 45:19:33:02.300 cpu5:4332)ScsiDeviceIO: 747: Command 0x0 to device "naa.60060160f192280083c5674f710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
Jan 21 11:27:14 <ESX HOST NAME> vmkernel: 45:19:33:02.303 cpu5:4970)NMP: nmp_CompleteCommandForPath: Command 0x0 (0x4100011a07c0) to NMP device "naa.60060160f1922800f64c4756710de011" failed on physical path "vmhba1:C0:T1:L92" H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
Jan 21 11:27:14 <ESX HOST NAME> vmkernel: 45:19:33:02.303 cpu5:4970)ScsiDeviceIO: 747: Command 0x0 to device "naa.60060160f1922800f64c4756710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
Jan 21 11:27:14 <ESX HOST NAME> vmkernel: 45:19:33:02.306 cpu5:4358)NMP: nmp_CompleteCommandForPath: Command 0x0 (0x410001166000) to NMP device "naa.60060160f1922800f74c4756710de011" failed on physical path "vmhba1:C0:T1:L93" H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
Jan 21 11:27:14 <ESX HOST NAME> vmkernel: 45:19:33:02.306 cpu5:4358)ScsiDeviceIO: 747: Command 0x0 to device "naa.60060160f1922800f74c4756710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.197 cpu5:4431)NMP: nmp_CompleteCommandForPath: Command 0x3c (0x4100010a8480) to NMP device "naa.60060160f192280082c5674f710de011" failed on physical path "vmhba1:C0:T1:L90" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.197 cpu5:4431)ScsiDeviceIO: 747: Command 0x3c to device "naa.60060160f192280082c5674f710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.198 cpu5:64257)NMP: nmp_CompleteCommandForPath: Command 0x3c (0x4100010923c0) to NMP device "naa.60060160f192280082c5674f710de011" failed on physical path "vmhba1:C0:T1:L90" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.198 cpu5:64257)ScsiDeviceIO: 747: Command 0x3c to device "naa.60060160f192280082c5674f710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.264 cpu5:29452)NMP: nmp_CompleteCommandForPath: Command 0x1a (0x410001124280) to NMP device "naa.60060160f192280058a7f865710de011" failed on physical path "vmhba1:C0:T1:L96" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.264 cpu5:29452)ScsiDeviceIO: 747: Command 0x1a to device "naa.60060160f192280058a7f865710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.277 cpu5:4359)NMP: nmp_CompleteCommandForPath: Command 0x1a (0x410001092dc0) to NMP device "naa.60060160f192280082c5674f710de011" failed on physical path "vmhba1:C0:T1:L90" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.277 cpu5:4359)ScsiDeviceIO: 747: Command 0x1a to device "naa.60060160f192280082c5674f710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.289 cpu5:5882)NMP: nmp_CompleteCommandForPath: Command 0x1a (0x4100011a24c0) to NMP device "naa.60060160f192280083c5674f710de011" failed on physical path "vmhba1:C0:T1:L91" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.289 cpu5:5882)ScsiDeviceIO: 747: Command 0x1a to device "naa.60060160f192280083c5674f710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.294 cpu5:14017)NMP: nmp_CompleteCommandForPath: Command 0x1a (0x410001272100) to NMP device "naa.60060160f1922800f64c4756710de011" failed on physical path "vmhba1:C0:T1:L92" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44<ESX HOST NAME> vmkernel: 45:19:33:32.294 cpu5:14017)ScsiDeviceIO: 747: Command 0x1a to device "naa.60060160f1922800f64c4756710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.298 cpu5:29453)NMP: nmp_CompleteCommandForPath: Command 0x1a (0x410001120040) to NMP device "naa.60060160f1922800f74c4756710de011" failed on physical path "vmhba1:C0:T1:L93" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.298 cpu5:29453)ScsiDeviceIO: 747: Command 0x1a to device "naa.60060160f1922800f74c4756710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.314 cpu2:64251)NMP: nmp_CompleteCommandForPath: Command 0x1a (0x410001271200) to NMP device "naa.60060160f1922800f84c4756710de011" failed on physical path "vmhba2:C0:T1:L94" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.314 cpu2:64251)ScsiDeviceIO: 747: Command 0x1a to device "naa.60060160f1922800f84c4756710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.347 cpu2:71084)NMP: nmp_CompleteCommandForPath: Command 0x1a (0x410001120840) to NMP device "naa.60060160f1922800ab79d55f710de011" failed on physical path "vmhba2:C0:T1:L95" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Jan 21 11:27:44 <ESX HOST NAME> vmkernel: 45:19:33:32.347 cpu2:71084)ScsiDeviceIO: 747: Command 0x1a to device "naa.60060160f1922800ab79d55f710de011" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Now this is a newish new cluster using esx 4 update 1. Im sure i have a issue at the storage end, when these were built they implemented recover point and a SANTAP device for all the paths to go through, this has caused us issues in the past with dropping storage. has anyone got experiance in using these devices with ESX? I heard vmware dont recommend ESX hosting using recoverpoint?

I might add, when both nodes of the MSCS cluster are on the same host it works and there is no errors and failovers are flawless, only when they are on different hosts i come across this issue. But as mentioned before since the day these were built ive been seening the above path failures riddled through the kernel logs for LUNs that hold the Datastores for the virtual machines, these path failures are on every host too(the only common thing being split across different enclosures is the san paths). the above snippet was when initating a failure of a MSCS node where I see path failures for every RDM in the resource.

should add also virtuals have been setup to the letter for MSCS so its not an issue on that end, and as mentioned before i get these path failures for all LUNs on all hosts but they are more riddled thought the kernel logs like cancer, just where when doing a MSCS resource group failure i get a big block of them.

any ideas? an issue on storage end? if so recoverpoint?

hope i explained that ok.... comes across clear in my head hahaha

Cheers

________________________________________ Blog: http://virtualiseme.net.au VCDX #201 Author of Mastering vRealize Operations Manager
Reply
0 Kudos
4 Replies
ThompsG
Virtuoso
Virtuoso

Hi,

Just wondering if you wouldn't mind sharing your SAN layout with us? Basically after number of fabrics and number of switches in each fabric.

Thanks and kind regards.

Reply
0 Kudos
NuggetGTR
VMware Employee
VMware Employee

To be honest im not 100% of the layout Im just the "VMware guy" know a little about storage but its not really my area. infact i highly doubt our storage team know the layout 100%... not joking either

we use HP kit running flex 10  there is 2 FC modules per enclosure each module has 4 uplinks, all 4 uplinks of module A connect to Fabric switch A, all 4 uplinks form module B connect to Fabric Switch B I couldn't really tell you what happens after that.

in virtual connect manager there is SAN A(module A) and SAN B(module B) these represent vmhba1 and vmhba2 for all blades. the cluster im talking about is spread over 4 enclosures all with the same setup.

VMware support think its something at the storage end i tend to agree but there is always a possibility it could be the flex 10 and virtual connect or something like that.

I know the errors im seeing for failed path means that ESX has tried to communicate with the LUN but has not received a reply and then asks for the fast path state update. it sort of fits with my recoverpoint/santap theory as that sits in the middle so both ESX/virtual connect think storage is up and the SAN and switches think storage is up but the middle man isnt passing through correctly. but as I said im not really a storage guru.

sorry if that really didnt help

________________________________________ Blog: http://virtualiseme.net.au VCDX #201 Author of Mastering vRealize Operations Manager
Reply
0 Kudos
ThompsG
Virtuoso
Virtuoso

Hi there,

Well you have confirmed that it is not the same issue we had with faulty ISL's between switches - so one less thing to worry about Smiley Wink

We run HP blades here as well however we run Pass-thru modules rather than flex. There was an issue with the HP4GB Fibre Channel Pass-thru Modules and high I/O that caused problems which was fixed by a firmware update on the modules. This needed to be done after the Onboard Administrator (OA) firmware was upgraded. Have you updated the firmware version on the OA and Virtual Connect?

Sorry if this is covering ground that has already been covered but you know what they say about assumption...

Kind regards.

Reply
0 Kudos
NuggetGTR
VMware Employee
VMware Employee

Hey cheers for the help.

The OA and VC are up to date with the latest firmware that was as off November/December last year.

We were having issue with FC modules dropping which was to do with the firmware miss match but this has been stable since the updates.

________________________________________ Blog: http://virtualiseme.net.au VCDX #201 Author of Mastering vRealize Operations Manager
Reply
0 Kudos