VMware Cloud Community
albibuk
Contributor
Contributor

ESXi 5 - IBM HS22V - V7000 - High Latency / Multipath Issue ?

Hi everyone,

   in a customer site a fresh new install:

3 HS22V in BladeCenter H

each with Qlogic 8 GB Mezzanine (QMI 2582 in driver detail /proc/scsi/qla2xxx),

Installed ESXi 5 U1 on USB Key ,

driver installed on base (9.01)

because driver 9.11 causes freeze of system (there is a KB VMWare on this, so we cannot update)

connected via FC to Storwize V7000.

Installation went smoothly.

Zoning OK (checked with IBM Support)

LUN presentation OK (for now 2 LUN , LUN 1 and LUN 2, each 2 TB )

Disk correctly viewed from Hosts

The problem we are facing is there are latency issue in datastore utilization.

Example: when we try to install a new VM Windows 2008 R2 (for test purpose)

we see datastore latency write randomly go to 2000-2500 ms

latency read goes beyond 500-600 ms .

Checking vmkernel.log we see this kind of messages:

2012-04-05T08:12:55.597Z cpu0:4120)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x28 (0x4124003cdf80, 16269) to dev "naa.600507680280830af000000000000001" on path "vmhba2:C0:T1:L1" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2012-04-05T08:12:55.597Z cpu0:4120)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000001" state in doubt; requested fast path state update...
2012-04-05T08:12:55.597Z cpu0:4120)ScsiDeviceIO: 2309: Cmd(0x4124003cdf80) 0x28, CmdSN 0x80000061 from world 16269 to dev "naa.600507680280830af000000000000001" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T08:13:00.760Z cpu2:4098)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000001" state in doubt; requested fast path state update...
2012-04-05T08:13:00.760Z cpu2:4098)ScsiDeviceIO: 2309: Cmd(0x412400d698c0) 0x28, CmdSN 0x8000001c from world 16269 to dev "naa.600507680280830af000000000000001" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T08:13:01.671Z cpu0:4120)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000001" state in doubt; requested fast path state update...
2012-04-05T08:13:01.671Z cpu0:4120)ScsiDeviceIO: 2309: Cmd(0x4124003cdf80) 0x28, CmdSN 0x80000003 from world 16269 to dev "naa.600507680280830af000000000000001" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T08:13:02.076Z cpu0:4120)ScsiDeviceIO: 2309: Cmd(0x4124003cdf80) 0x28, CmdSN 0x8000002d from world 16269 to dev "naa.600507680280830af000000000000001" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T08:13:02.582Z cpu0:4120)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000001" state in doubt; requested fast path state update...
2012-04-05T08:13:02.582Z cpu0:4120)ScsiDeviceIO: 2309: Cmd(0x4124003cdf80) 0x28, CmdSN 0x80000045 from world 16269 to dev "naa.600507680280830af000000000000001" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T08:13:09.112Z cpu16:4112)LinNet: map_skb_to_pkt:288: This message has repeated 458752 times: invalid vlan tag: 4095 dropped
2012-04-05T08:13:51.175Z cpu2:4098)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x28 (0x4124003cce80, 16269) to dev "naa.600507680280830af000000000000001" on path "vmhba2:C0:T1:L1" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2012-04-05T08:13:51.175Z cpu2:4098)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000001" state in doubt; requested fast path state update...
2012-04-05T08:13:51.175Z cpu2:4098)ScsiDeviceIO: 2309: Cmd(0x4124003cce80) 0x28, CmdSN 0x8000003d from world 16269 to dev "naa.600507680280830af000000000000001" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T08:13:53.908Z cpu1:4097)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000001" state in doubt; requested fast path state update...
2012-04-05T08:13:53.908Z cpu1:4097)ScsiDeviceIO: 2309: Cmd(0x4124003cdf80) 0x28, CmdSN 0x80000055 from world 16269 to dev "naa.600507680280830af000000000000001" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T08:13:54.921Z cpu2:4098)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000001" state in doubt; requested fast path state update...
2012-04-05T08:13:54.921Z cpu2:4098)ScsiDeviceIO: 2309: Cmd(0x4124003cb280) 0x28, CmdSN 0x80000057 from world 16269 to dev "naa.600507680280830af000000000000001" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T09:04:39.966Z cpu22:4118)WARNING: ScsiDeviceIO: 1218: Device naa.600507680280830af000000000000002 performance has deteriorated. I/O latency increased from average value of 9989 microseconds to 204862 microseconds.
2012-04-05T09:04:39.978Z cpu22:4118)WARNING: ScsiDeviceIO: 1218: Device naa.600507680280830af000000000000002 performance has deteriorated. I/O latency increased from average value of 11029 microseconds to 229335 microseconds.
2012-04-05T09:05:12.221Z cpu22:4118)ScsiDeviceIO: 1198: Device naa.600507680280830af000000000000002 performance has improved. I/O latency reduced from 229335 microseconds to 59624 microseconds.
2012-04-05T13:04:42.045Z cpu23:4119)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000002" state in doubt; requested fast path state update...
2012-04-05T13:04:42.045Z cpu23:4119)ScsiDeviceIO: 2309: Cmd(0x4124403ae500) 0x28, CmdSN 0x1dec3 from world 5937 to dev "naa.600507680280830af000000000000002" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T13:04:42.349Z cpu23:4119)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000002" state in doubt; requested fast path state update...
2012-04-05T13:04:43.563Z cpu16:4112)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000002" state in doubt; requested fast path state update...
2012-04-05T13:04:44.475Z cpu16:4112)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000002" state in doubt; requested fast path state update...
2012-04-05T13:04:45.082Z cpu16:4112)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x28 (0x4124403ae500, 5937) to dev "naa.600507680280830af000000000000002" on path "vmhba2:C0:T1:L2" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2012-04-05T13:04:45.386Z cpu17:4113)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000002" state in doubt; requested fast path state update...
2012-04-05T13:04:46.297Z cpu17:4113)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000002" state in doubt; requested fast path state update...
2012-04-05T13:04:47.512Z cpu17:4113)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000002" state in doubt; requested fast path state update...
2012-04-05T13:04:47.870Z cpu16:4112)ScsiDeviceIO: 2309: Cmd(0x412440ef69c0) 0x28, CmdSN 0x1dec9 from world 5937 to dev "naa.600507680280830af000000000000002" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T13:04:48.174Z cpu18:4114)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x28 (0x412440df3d80, 5937) to dev "naa.600507680280830af000000000000002" on path "vmhba1:C0:T0:L2" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2012-04-05T13:04:48.174Z cpu18:4114)ScsiDeviceIO: 2309: Cmd(0x412440df3d80) 0x28, CmdSN 0x1decc from world 5937 to dev "naa.600507680280830af000000000000002" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T13:04:48.478Z cpu18:4114)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000002" state in doubt; requested fast path state update...
2012-04-05T13:04:48.781Z cpu18:4114)ScsiDeviceIO: 2309: Cmd(0x412440ef5cc0) 0x28, CmdSN 0x1ded1 from world 5937 to dev "naa.600507680280830af000000000000002" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T13:04:49.389Z cpu18:4114)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600507680280830af000000000000002" state in doubt; requested fast path state update...
2012-04-05T13:04:49.389Z cpu18:4114)ScsiDeviceIO: 2309: Cmd(0x412440e70d00) 0x28, CmdSN 0x1ded4 from world 5937 to dev "naa.600507680280830af000000000000002" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T13:04:49.693Z cpu18:4114)ScsiDeviceIO: 2309: Cmd(0x412440ef5bc0) 0x28, CmdSN 0x1ded6 from world 5937 to dev "naa.600507680280830af000000000000002" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-04-05T13:04:49.996Z cpu18:4114)ScsiDeviceIO: 2309: Cmd(0x412440ef56c0) 0x28, CmdSN 0x1ded8 from world 5937 to dev "naa.600507680280830af000000000000002" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Rarely also:
WARNING: ScsiDeviceIO: 1218: Device naa.600507680280830af000000000000001 performance has deteriorated. I/O latency increased from average value of 248 microseconds to 7629 microseconds

So storage utilization is unoptimized.

Checked message H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0

on vmware KB - seems Host Busy -

but busy on what ?

Also tried to install ESXi 4.1 U2 (also with driver 8.41 ) - same issue .

Tried changing qla2xxx queue length - from 32 to 64 as stated on KB - same issue .

Also a collegue has the same problem but on storage DS3500 (also on Fibre Channel )

Maybe the issue is not storage

but QLogic adapter ?

Any suggestion on how to remediate this problem

is very appreciated,

this issue is driving me crazy .

Edit:

Also tried to reseat Blades ,

problem persist

Tags (4)
35 Replies
albibuk
Contributor
Contributor

Finally found a "Solution"

best definition would be workaround...

We changed CIOV HBA FC Adapters

with CFFH Combined FC - Eth Adapter ,

turned the FC Switches from Vertical to Horizontal

and finally now all seems to work correctly.

The idea came from last configuration that worked correctly,

and had CFFH FC HBA for storage connection.

Since change the problem disappeared.

We don't know however if it is VMWare

or IBM issue.

Thank you for all replies

0 Kudos
splatt
Contributor
Contributor

Maybe a little late but I'm sure another hint may be worth thinking about

I had a very similiar issue and had terrible latencies over 20 seconds (not milliseconds!) and terrible performance

I had brocade san switches and qlogic 8gb HBAs

The magic was the fillword setting with older san switch firmware levels, 6.3.x, if i set this to 3 from the default 1, cleared my counters on the ports all worked beautifully.   Apparently FOS 6.4 is ok, there is an IBM article for this.

I used porterrshow to see the bad effects and portcfgshow to see the changed settings.

0 Kudos
jesse_gardner
Enthusiast
Enthusiast

Wholesale replacement of all HBAs, and moving the switches in the chassis?  Ick (and expensive!)

0 Kudos
HannaL
Enthusiast
Enthusiast

Is this the article you are thinking of?

http://www-304.ibm.com/support/docview.wss?uid=ssg1S1003699

I have heard of fillword being wrong causing really bad problems with VM's such as Linux Guest's going read-only

Here is an article my coworker wrote on this

http://www-304.ibm.com/support/docview.wss?uid=isg3T1013082

Hope that helps, Hanna --- BSCS, VCP2, VCP VI3, VCP vSphere, VCP 5 https://www.ibm.com/developerworks/mydeveloperworks/blogs/vmware-support-ibm
0 Kudos
jesse_gardner
Enthusiast
Enthusiast

Just thought I'd chime in with our experiences.  We've been aware of this issue for about a year now.  We've predominently experienced it with IBM HS22V's, Qlogic 8gb CIOv's (44x1947), connecting to Brocade 8gb SAN switches (44x1924).  We had no issues with 100 HS21's before then.

We have a "Crit Sit" case with IBM regarding these issues, and have been digging deep with their Brocade engineer partners.  I'll certainly post here if we ever get to the bottom of this, but I have a hunch it's only going to be through attrition of these specific Qlogic HBAs or Brocade switches.

To quote my observations from December (here http://communities.vmware.com/message/1764151😞

After extensive troubleshooting with IBM, VMware, and EMC, we were unable to get to a simple root cause for this.  We found a pretty good workaround in house through trial and error.  The main symptom that can be easily seen is an extremely high port error count on the fiber switch ports.  Millions and even billions of errors were witnessed in our environment.  While noticed first in VMware, upon further investigation we found the same issue on Windows and Linux blades that "never seemed quite right" anyways.

In short, our workaround can be summarized as: "Adjust HBA and switch port speed settings until it works well and port errors are minimal."  Our starting point is to hardcode 4gbps on both the HBA and switch port for 8gbps switches, and 2gbps for 4gbps switches.  Seem strange, but that's what usually works.  Sometimes the ports need to be disabled and re-enabled, or turned to Auto speed and then back (to 4 or 2 gbps) before they'll work properly.

We've found a couple evil, evil ports.  We've replaced HBAs and entire blades, and the blade in that slot just never works right.  Moving the blade to a slot right next to it will make the blade work fine (effectively different fiber port, same fiber switch and blade chassis).

Different firmware revisions, HBA configurations, switch port configurations (fillword, etc) have had no conclusive impact on this issue.

Some notes about our environment:

  • We have both 4gbps (32R1820) and 8gbps (44X1926) Brocade switches
  • We have both 4gbps (49Y4237) and 8gbps (44X1948) Qlogic HBAs
  • Our "Core" switches are old and only support 2gbps, so the upstream speed configurations have no impact on true bandwidth to storage.
  • Our Brocade switches are in Gateway mode, which I gather is probably not the most popular configuration.  IANASG (I Am Not A SAN Guy).
  • I just now noticed that some of our blades are in an unsupported configuration (http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5083850), though many of our problems are from supported configurations as well.

Something is not quite right about these newer Qlogic cards' interaction with the newer Brocade switches.



To sum it up: Until we get to a proper resolution, we're constantly adjusting port speeds on our blades and fiber switch ports.  Its enough to make you want to pull your hair out.

0 Kudos
splatt
Contributor
Contributor

This is the article I'm working with at the moment with IBM,  dated 25/5/12, basically update firmware on the HBA including the EDC

My HBa's were just below this on all levels updated in Feb 12

https://www-947.ibm.com/support/entry/myportal/docdisplay?lndocid=MIGR-5087103

Needs a login to their site

I've made these changes using a DS 3524 SAN and it goes 50 times faster!

0 Kudos
albibuk
Contributor
Contributor

I know, Jesse,

expensive,

but now the problem is gone,

and the infrastructure works as expected.

Now will check with IBM in a technical meeting

what to do.

0 Kudos
albibuk
Contributor
Contributor

Hi splatt,

already done all updates outlined in this article,

no luck for me,

hope will be better for you

Bye,

Albert

0 Kudos
smateo
Contributor
Contributor

Thank You to Xeonel! You saved me!

To solve the problem, we have to activate the following:

  • flowcontrol receive on
  • flowcontrol send on

on all the interfaces involved in FCoE in all switches, Nexus 5548P and 4001i.

Thanks for all, guys.

0 Kudos
Xeonel
Enthusiast
Enthusiast

Hey smateo,

I'm glad it worked for you. We still see issue where in average once a week one host cannot access the storage anymore. We used to recover it by flapping an interface on the N4k switch (either of the fabrics worked) but after upgrading FW to 4.1 on emulex adapter we can't recover it anymore.

We've got IBM, VMware and Cisco actively involved in the investigation but so far we don't have a lot to go on.

Can you please describe your environment with more detail (HW specs and FW levels)? We're also using flow control, but we've got priority flow control throughout the cisco environment.

Thanks,

Andrei

0 Kudos
smateo
Contributor
Contributor

HI Xeonel,

the architcture is standard, I am using 2 5548 with several 2248 FEX as distribution switches with the VPC feature.

I have Nexus 4001i and the SAN direclty connected to the 5548 in FCoE.

The firewalls we are using are Juniper SSG series however for my problem they are not in the picture.

Despite having all the QoS right, the problem was resolved by enabling the flowcontrol (send and receive) on the FCoE interfaces as below:

On the 4001i: Interfaces facing the 5548

On the 5548: interfaces facing 4001i and SAN.

The solution was simple and work immediatly.

0 Kudos
hpfplane
Contributor
Contributor

I have the same problem!!

Can anyone help me?!

Server: IBM X3650M3

HBA:42D0501 - QLogic QLE2560 8Gb FC Single-port HBA for System x

Storage:IBM DS3524

0 Kudos
jesse_gardner
Enthusiast
Enthusiast

Long story very short, we're beginning to gather some evidence pointing to our old 2gb McData upstream core switches as being related to the problem.  Fortunately we are replacing them in the next few months.  The symptom has only appeared on our HS22's & HS22V's with the 8gb Brocade Bladecenter switches, so we were focusing mostly there.  Perhaps having 8gb HBA's and blade switches, along with old 2gb core switches is the problem for us.  We'll know more in a week or two.

0 Kudos
GLoblba
Contributor
Contributor

Someone Have testing with the last Qlogic drivers :

934.5.6.0-1vmw

I have the same problem and I want to know if someone resolve this issue with this upgrade ???

0 Kudos
williambishop
Expert
Expert

Having a similar issue with v7000 attached to cisco UCS (both at all latest firmwares). IBM says fillword is at issue, but there is not an option to change anything on the UCS switches. We are getting MILLIONS of transits according to the storwize, but nothing visible on the 6296's. Anybody got any ideas? At this point I'm looking at two options, top of rack 9148's or let EMC take the floor space (tempting anyway as I've had nothing but issues from the v7000's....even though they were blazing fast initially with the UCS...and worked for 6 months before finally dropping 1300 desktops with 20 SECONDS of latency writes....

--"Non Temetis Messor."
0 Kudos
jesse_gardner
Enthusiast
Enthusiast

We'd had various high tier and even "Crit Sit" cases with IBM for over a year, and feel we're now getting to the bottom of all the issues we've had with IBM blade SAN connectivity.

Our first problem was a previously unknown incompatibility between the Brocade 8gb Bladecenter switches and our quite old 2gb McData ED140 core switch.  This problem caused random blades at random times to throw tons of fiber port errors (CRC's, encoding errors, etc).  We pulled our hair out for years but eventually upgraded our core to newer Brocade 8gb switches.  Which eliminated the first problem but exposed a new problem:  Blades in certain slots just performed poorly, and would occasionally spit CRC's, ENC errors, discards, etc.  As it turns out, near the end of 2012 IBM had diagnosed that half of the 14 blade slots in all BladeCenter chassis do not have optimal connectivity length between the blade and the switch.  Some are too long, some are too short.  Slots 3 & 4, and slots 10-14 as I recall.  Our experience has been that slots 4, 12, 13, 14 have been the most commonly bad.  IBM has a new midplane available that is supposed to mitigate this issue.  As fun as replacing midplanes in many bladechassis' sounds...

Another good discussion with a group leading to the same conclusion:

https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-0000148814...