Solved: Re: Fiber Channel -- failed H:0x5 D:0x0 P:0x0

rayvd · ‎03-29-2013

Hi everyone... running into a pretty baffling problem as we're looking to establish FC connectivity from our ESXi 5.x hosts to an IBM Gen2 XIV.

We're deploying this at two separate sites with identical hardware -- save for that one site is running ESXi 5.0 and the other site is running ESXi 5.1. Both configurations are a number of Dell blades with QLogic HBA's talking to an in-bladecenter Brocade switch which is ISL'd up to our fiber "core" switch environment off which the XIV hangs directly. Using WWPN based zoning and two fabrics (two core FC switches and two FC switches in each blade center).

Our problem is that at the site running ESXi 5.1 HBA rescans on the ESXi hosts take a *long* time -- when we had only a few LUNs rescans were taking ~5 minutes... now that we have 20 or so LUNs exposed we're up to 35-40 minutes for a rescan to complete. This also makes host reboots take an excessive amount of time (which makes sense). At the site running ESXi 5.0, everything is "normal" -- rescans complete in under a minute.

The following errors can be observed in vmkernel.log throughout the rescan (and actually at any time though the day -- they seem to be more frequent during rescan activity):

2013-03-29T20:24:32.112Z cpu6:8198)ScsiDeviceIO: 2329: Cmd(0x4124003e56c0) 0x1a, CmdSN 0x73f5a from world 0 to dev "eui.001738000f86088c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

We have tickets open with both VMware (who pointed us to our storage vendor, IBM) and IBM. IBM made a few suggestions (nothing major) which we implemented but didn't help. From their perspective, our SAN environment looks fine -- no errors on ports, zoning correct, firmware up to date, etc. We're continuing to work the support angle, but wanted to throw the issue out here in case anyone has any suggestions.

At this point, our next steps are more along the lines of divide and conquer / trial and error. With the newer version of ESXi there's also a newer version of firmware on the QLogic HBA's than there is at the site running ESXi 5.0 (where there are no problems). We may try downgrading one of our 5.1 hosts to 5.0 to see if the problem follows. Next steps are trying to reproduce the issue from an OS other than ESXi, and then perhaps from a standalone host attached direct to the BC switch. And so on and so on. Unfortunately, this is all PRD gear, so everything has to be scheduled and it'll take a while to get through all of the trial & error.

Anything here jump out to anyone that could help us jumpstart solving the problem?

I'll note that in watching the logs, *some* LUNs seem to throw errors more frequently than others (and the number of errors is pretty consistent across each group). I thought this perhaps had to do with some of the LUNs being detected as supporting "Hardware Acceleration" and others not (which is also baffling since these are all LUNs on the same XIV -- why wouldn't they all support or not support HW Acceleration?).

Thanks in advance!

Ray

a_p_ · ‎03-29-2013

With only two cluster LUNs I'm not sure whether this is related, but at least it's worth reading

http://blogs.vmware.com/vsphere/2013/03/esxi-slow-boot-with-mscs.html

André

View solution in original post

a_p_ · ‎03-29-2013

Only a thought. Are any of these LUNs used as RDM's for e.g. Microsoft Cluster VMs?

André

rayvd · ‎03-29-2013

All of the LUNs are attached as RDM's to MS SQL servers, but only one or two of those are used in a shared-disk SQL cluster.

a_p_ · ‎03-29-2013

With only two cluster LUNs I'm not sure whether this is related, but at least it's worth reading

http://blogs.vmware.com/vsphere/2013/03/esxi-slow-boot-with-mscs.html

André

rayvd · ‎03-29-2013

Wow. That was it. We actually had a lot more MSCS LUNs than I thought. It was a PITA to pick them all out and change the flag, but once done, rescans now complete in seconds.

How VMware Support/IBM missed this, I have no idea!

Many many thanks.

a_p_ · ‎03-30-2013

Glad to hear this solved the issue.

How VMware Support/IBM missed this, I have no idea!

Well, I guess the reason I thought of this was that I was just lucky (was it luck???) enough to troubleshoot/solve the same issue for a customer a few weeks ago.

André