Re: Erratic MultiPathing Behaviour on ESX 3.5 host

levinton2011101 · ‎11-17-2011

Hi guys,

I'm almost clueless as to whats happening here. We have several ESX 3.5 Clusters facing this issue across our Virtual Datacenter:

Only 1 Host from a 6-node ESX 3.5 Cluster has a couple of LUNs missing some paths/showing couple of paths as Dead. Our configuration is as follows:

Host with 2 Single-Port Emulex HBA's each one going to a different Fabric (HBA 1 goes to Fabric A, HBA 2 goes to Fabric B). Zoned with 4 Ports (each on a different node) on a 3Par T800 array.

This is actually really confusing as:

A) All other LUNs (which come from the same Storage and the same Storage Ports) have all their paths as they should. We have a standard of 4 Paths with a Fixed PSP for ESX 3.5 hosts.

B) Same LUN in other hosts from the same cluster have all the Paths as they should (that is, no missing nor dead paths at all :S)

C) In the Host were dead/missing paths are seen, we have 1 path missing from each HBA, that means the LUN still has 1 path online goint thru each HBA so that actually rules out any connectivity issue.

I have already verified our SAN switches, cores, everything, and all I'm thinking of is some kind of bug within ESX.

Any ideas/pointers???

Best,

Jon

mcowger · ‎11-17-2011

Rebooted the host?

Check your VLUN configuration on the 3PAR side?

--Matt VCDX #52 blog.cowger.us

levinton2011101 · ‎11-17-2011

Already check'd that

I think my next action is to enable lpfc driver verbose logging as well as SCSI mid-layer logging:

http://www.emulex.com/knowledge/search/viewArticle.jsp?docId=29

Best,

Jon

PacketRacer · ‎11-17-2011

That's a fascinating one!

A couple of things to try (you may have tried them already):

1) If you present a test LUN to your hosts, do you get the same results?

2) If you're OK with the risk, unplug one of the host's HBAs from the switch and plug it back in (or disable/enable port). This will force it to login to the fabric again. That may either fix the problem, or... you might discover that the other LUNs on that host actually have the same problem but because you're using the Fixed policy those paths haven't been used yet.

3) Have you checked the logs under /var/log? Check dmesg and vmkernel (sorry, don't remember any more what the logs are in 3.5). It might turn out more useful than the emulex debug log.

Hope this helps! Please post an update when you can - I'm curious.

levinton2011101 · ‎11-17-2011

Hey man!

Thanks a lot for the suggestions, intelligent ones But already tried all of them.

Regarding the logs, I actually found some clues so here is some food for your curiousness hehe.

First, doing recurring cats of the lpfc info shows that Target ports are actually flapping, that is, they disappear and they re-appear a couple of minutes after. This behaviour keeps going:

[root@XX root]# cat /proc/scsi/lpfc/1 | grep -A 10 "Current Mapped Nodes"
Current Mapped Nodes on Physical Port:
lpfc0t02 DID 0ab400 WWPN 25:53:00:02:ac:00:0e:36 WWNN 2f:f7:00:02:ac:00:0e:36
lpfc0t01 DID 0ab800 WWPN 50:06:0e:80:05:42:a3:49 WWNN 50:06:0e:80:05:42:a3:49

After a while...

[root@XX root]# cat /proc/scsi/lpfc/1 | grep -A 10 "Current Mapped Nodes"
Current Mapped Nodes on Physical Port:
lpfc0t02 DID 0ab400 WWPN 25:53:00:02:ac:00:0e:36 WWNN 2f:f7:00:02:ac:00:0e:36
lpfc0t00 DID 1ebc00 WWPN 50:06:0e:80:05:42:a3:69 WWNN 50:06:0e:80:05:42:a3:69
lpfc0t01 DID 0ab800 WWPN 50:06:0e:80:05:42:a3:49 WWNN 50:06:0e:80:05:42:a3:49

And this entry in /var/log/vmkernel adds interesting value. I think the HBA maintains the address of the targets flapping but ESX is changing them somehow, and thus nothing matches:

Nov 17 21:57:56 XX vmkernel: 101:08:16:35.838 cpu0:1061)SCSI: 2350: Could not verify that the disk id of path vmhba0:0:11 matches the id of the target

And after doing some more research (I work for storage, not for the VMware team, but I find this stuff fascinating); The HBA's on this host have different firmware versions (one of them reaaaaally old) and this might be seriously the cause. Already opened a case with Emulex:

[root@XX hbanyware]# cat /proc/scsi/lpfc/1 | grep Firmware
Firmware Version: 2.50A4 (W2F2.50A4)

[root@XX hbanyware]# cat /proc/scsi/lpfc/2 | grep Firmware
Firmware Version: 2.80A4 (W3F2.80A4)

Will let you know how this turns out

Thanks once again for your help!!

Best,

Jon

levinton2011101 · ‎11-18-2011

K some updates from the Emulex Tech Support Engineer:

Emulex Fibre Channel adapters generate no SCSI commands neither do the drivers. They are transport only. LUN resets will be logged by the Emulex driver as an indication that the SCSI layer of the OS (ESX 3.5 in this case) requested a reset and the driver has completed the request. I/O aborts mean the layers above the Emulex driver have given up on commands and aborted the I/O. The driver is usually just reporting the issue. These types of errors are SCSI level meaning the SCSI layer of the OS is timing out SCSI commands sent to the storage. If there were Fibre Channel level issues (the transport layer) there would be other messages from the driver indicating issues. If you only see the I/O aborts and LUN resets, then the HBA and driver are not at fault. If paths disappear in the driver, then that usually means a Fibre Channel layer issue is occurring causing the driver to drop the node (the target) due to communications issues. You may need to get a Fibre Channel analyzer on site to take a trace to find the issue.

And

I noticed some of the HBAs are using very old firmware v 2.50A4 and no longer supported by Emulex. It does have a few known issues with connections and since there are firmware dumps from the HBA, the firmware might be crashing on the card. The driver would then reset the card and it would take off from there. That might cause paths to disappear.

Also, the HBAs you have are not Emulex products. The LPe1150-E are EMC products and support comes through your local EMC support channels or resellers.

I recommend you update the firmware on all the LPe1150-E to the latest supported by EMC:

http://www.emulex.com/downloads/emc/lpe1150-e/firmware-and-boot-code.html

So, we are in the progress to update the Firmwares to see if helps.

Best,

jon