ESX 4.0 rejects new SAN storage path through emule...

VirtualSac · ‎04-07-2010

Hi,

I have esx 4.0 cluster with 2 hosts. One host has no issues with 3par Storage through Brocade, but second host has following issues.

1. It has 2 dual ported HBAs, where only one port works for storage connectivity.

2. Below is output of "esxcfg-scsidevs -a"

vmhba0 lpfc820 link-down fc.20000000c967f016:10000000c967f016 (7:1.0) Emulex Corporation LP10000-S 2Gb Fibre Channel Host Adapter

vmhba1 lpfc820 link-down fc.20000000c967f015:10000000c967f015 (7:1.1) Emulex Corporation LP10000-S 2Gb Fibre Channel Host Adapter

vmhba2 lpfc820 link-n/a fc.20000000c957dbfc:10000000c957dbfc (7:2.0) Emulex Corporation LP10000-S 2Gb Fibre Channel Host Adapter

vmhba3 lpfc820 link-up fc.20000000c957dbfd:10000000c957dbfd (7:2.1) Emulex Corporation LP10000-S 2Gb Fibre Channel Host Adapter

vmhba4 mptsas link-n/a sas.50003ba0000003ba (7:4.0) LSI Logic / Symbios Logic LSI1064

vmhba5 pata_amd link-n/a ide.vmhba5 (0:6.0) nVidia Corporation NVidia NForce CK804 IDE/PATA Controller

+ We have a storage controller at 0:0.0+

vmhba32 usb-storage link-n/a usb.vmhba32 (0:0.0) nVidia Corporation CK804 Memory Controller

vmhba33 pata_amd link-n/a ide.vmhba33 (0:6.0) nVidia Corporation NVidia NForce CK804 IDE/PATA Controller

vmhba3 works good. I tried to establish second path to storage through vmhba0 and vmhba1 but as soon as we enable zone on brocade the status for HBA gets changed to link-down.

3. I get below errors very frequently in /var/log/vmkernel

Apr 7 17:10:11 vmhost1 vmkernel: 53:01:00:03.309 cpu15:4276)ScsiNpiv: 1304: GetInfo for adapter vmhba0, , max_vports=0, vports_inuse=0, linktype=0, state=0, failreason=0, rv=-2, sts=bad0020

Apr 7 17:10:11 vmhost1 vmkernel: 53:01:00:03.309 cpu8:5912)ScsiNpiv: 1304: GetInfo for adapter vmhba1, , max_vports=0, vports_inuse=0, linktype=0, state=0, failreason=0, rv=-2, sts=bad0020

Apr 7 17:10:11 vmhost1 vmkernel: 53:01:00:03.309 cpu8:5912)ScsiNpiv: 1304: GetInfo for adapter vmhba3, , max_vports=0, vports_inuse=0, linktype=0, state=1, failreason=0, rv=0, sts=0

4. Some more findings from /proc/scsi/lpfc820 as it is Emulex HBAs

# more /proc/scsi/lpfc820/*

::::::::::::::

/proc/scsi/lpfc820/5

::::::::::::::

Emulex LightPulse Fibre Channel SCSI driver 8.2.0.30.49vmw

Emulex LightPulse LP10000DC-S 2Gb PCI-X Fibre Channel Adapter on PCI bus 07 device 08 irq 153 port 0

BoardNum: 0

Firmware Version: 1.92A1 (T1D1.92A1)

Portname: 10:00:00:00:c9:67:f0:16 Nodename: 20:00:00:00:c9:67:f0:16

Error: State is -1

::::::::::::::

/proc/scsi/lpfc820/6

::::::::::::::

Emulex LightPulse Fibre Channel SCSI driver 8.2.0.30.49vmw

Emulex LightPulse LP10000DC-S 2Gb PCI-X Fibre Channel Adapter on PCI bus 07 device 09 irq 161 port 1

BoardNum: 1

Firmware Version: 1.92A1 (T1D1.92A1)

Portname: 10:00:00:00:c9:67:f0:15 Nodename: 20:00:00:00:c9:67:f0:15

Error: State is -1

::::::::::::::

/proc/scsi/lpfc820/7

::::::::::::::

Emulex LightPulse Fibre Channel SCSI driver 8.2.0.30.49vmw

Emulex LightPulse LP10000DC-S 2Gb PCI-X Fibre Channel Adapter on PCI bus 07 device 10 irq 161 port 0

BoardNum: 2

Firmware Version: 1.92A1 (T2D1.92A1)

Portname: 10:00:00:00:c9:57:db:fc Nodename: 20:00:00:00:c9:57:db:fc

Link Down

::::::::::::::

/proc/scsi/lpfc820/8

::::::::::::::

Emulex LightPulse Fibre Channel SCSI driver 8.2.0.30.49vmw

Emulex LightPulse LP10000DC-S 2Gb PCI-X Fibre Channel Adapter on PCI bus 07 device 11 irq 169 port 1

BoardNum: 3

Firmware Version: 1.92A1 (T2D1.92A1)

Portname: 10:00:00:00:c9:57:db:fd Nodename: 20:00:00:00:c9:57:db:fd

Link Up - Ready:

PortID 0x11800

Fabric

Current speed 2G

Physical Port Discovered Nodes: Count 2

t00 DID 011e00 State 06 WWPN 22:51:00:02:ac:00:03:6f WWNN 2f:f7:00:02:ac:00:03:6f

t02 DID 011c00 State 06 WWPN 22:44:00:02:ac:00:03:6f WWNN 2f:f7:00:02:ac:00:03:6f

Here in 7 and 8, why it shows "State is -1"?

As both servers has significant load and in production, it is difficult to find downtime, that might take time. Any suggestion which can help w/o downtime would be appreciated.

Thanks,

VirtualSac

golddiggie · ‎04-07-2010

Have you tripple confirmed there's no issue with the fiber switch and all cables? How about confirming that the switch hasn't lost some functionality (not bad per se, but not working right none the less)? Do you have any hardware support agreements covering the components that you can pull on for assistance?

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

VirtualSac · ‎04-07-2010

Yes. We do have Support.

VirtualSac · ‎04-07-2010

Yes, I have already crossed checked that. Earlier it was on vmhba0 because of issue I tried to move onto vmhba1 and as soon as path got established it brings the link down. Is their any way I can change the state of hba from link-down or get the "error -1" status cleared out?

Thanks

golddiggie · ‎04-07-2010

I would start reaching out to the support people for fiber segments... It could be something really simple has gone wonky in that configuration and is an easy fix (for them to discover with you). Since you're paying for the support, no point in not using it. If they, with you, go through everything and there's absolutely nothing wrong, I'd reach out to the VMware engineers/support people. Depending on the response time of the Fiber people, you might get faster results from the VMware support engineers. I do know that the first time I put in a help request, they were calling me in under 15 minutes (even though I wanted to do it the next day). This was after our normal business hours (EST), around 8PM I believe. Was able to work with them (easily) to resolve the issue. Next time, I made sure to be ready to work on the issue before reaching out to them.

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

golddiggie · ‎04-07-2010

I've not used fiber connections with ESX/ESXi 4 yet... It could be a quick communication with VMware support there... If it's an alarm, you should be able to acknowledge the alarm, and then clear it. Not to say it won't come back again with that method.

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

VirtualSac · ‎04-07-2010

Thanks Goddiggie,

I thought lets check in community if some one has came across similar situation.

I have worked with vmware support on this and sent them logs as well, but support replied saying ESX don't have capability to reject any path as such. Even last week we have done re-zoning on brocade as well and that to with recommendations from 3PAR support on call.

Now I am getting more confused, I think let those support guys fight with each other and I will watch the show.

golddiggie · ‎04-07-2010

If you do get them to fight, record it and/or sell tickets... :smileycool:

From what experience I've had with fiber products, they can be a total PiTA when it comes to trying to figure out why it's not working properly... The worst part is, often, tracing back to when the problem first started and if there were any changes made (anywhere near it, or in the entire configuration) around the same time. That could be anywhere from days to a month (or more) before then... It's another reason I stick with iSCSI whenever I can...

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

binoche · ‎04-07-2010

Apr 7 17:10:11 vmhost1 vmkernel: 53:01:00:03.309 cpu15:4276)ScsiNpiv: 1304: GetInfo for adapter vmhba0, 0x41000b05b380, max_vports=0, vports_inuse=0, linktype=0, state=0, failreason=0, rv=-2, sts=bad0020

Apr 7 17:10:11 vmhost1 vmkernel: 53:01:00:03.309 cpu8:5912)ScsiNpiv: 1304: GetInfo for adapter vmhba1, 0x41000b05b740, max_vports=0, vports_inuse=0, linktype=0, state=0, failreason=0, rv=-2, sts=bad0020

rv is -2 instead of 0 here, as I know, usually it means HBA configurations wrong, can you double recheck this hba configurations?

binoche, VMware VCP, Cisco CCNA

All

ESX 4.0 rejects new SAN storage path through emulex hba