VMware Cloud Community
jprudente
Contributor
Contributor

New host disconnecting from storage - Need help troubleshooting

Hi All,

We have a five-node vSphere 5.1 cluster, which has been working fine for years. The fifth host was recently added and is having performance problems. These problems manifest themselves as extremely slow VM performance, especially when attempting to log on to or reboot a (Windows) guest. As an example one Windows server we have will reboot in 4 minutes on any of our other hosts yet takes 25 minutes to reboot when on host 5.

I opened a case with VMWare support, and they identified the cause as the host disconnecting from its storage. I saw the errors in the logs myself (All Paths Down), so I don't question their diagnosis, but they were not able to tell me anything beyond that and told me to contact the storage vendor.

All five hosts are connected via FC through a single Brocade switch to a single Falconstor SAN. There's no multi-pathing involved. All hardware is on each vendor's HCL. VMWare confirmed the HBA drivers are correct and did not find any issues with the server configuration. All five hosts are on the same patch level, 5.1.0 1743533. Basically everything I can think to look at on the VMWare side seems fine. There are no errors on the Brocade switch.

Anyway, I contacted Falconstor, they checked their side of things and are telling me everything looks good. Specifically I'm being told the following (quoted verbatim): "If the client disconnects a path with us on the upstream, [the SAN] will not log the activity. We will log LUN resets and command aborts coming from the client that is affecting our targets but we do not see any of that."

I'm at a total loss here and not sure where to go next. I have Production Support with VMWare and the equivalent with Falconstor yet I'm in the classic "it's the other vendor" situation and am stuck.

Any advice as to my next step, how to troubleshoot further, etc., would be greatly appreciated.


Thanks,

James

36 Replies
jprudente
Contributor
Contributor

Spoke to the storage vendor again and they're not seeing the latency, which is frustrating. The SAN is handling commands as fast as it receives them. They did offer to do a three-way call, so hopefully VMWare will be agreeable.

Thanks again for your help. I'll definitely update the thread once I get a resolution.

0 Kudos
JPM300
Commander
Commander

NP, hopefully you get to the root cause, I look foward to the resolution you guys find.


Yeah Vmware has no problems working with other vendors, we used to do 3 ways calls with EMC / HP / NetApp / Cisco all the time.

I've had the same issue you have had where VMware looks at the host and goes well everything is working as it should, and once the SCSI command leaves the VMkernel its out of our hands, then you call the storage vendor and they say the same thing.  It usally a break down between some where which if both vendors tackle it together we usally ended up finding the problem.

0 Kudos
jprudente
Contributor
Contributor

I've been working with both Dell and VMWare and to be honest it's been VERY slow going, so I thought I'd post an update here in hopes the community has some feedback.

As a quick recap, my problem was initially approached as an all paths down / disconnection issue, but in reality has turned out to be a latency issue to the array. Eventually when the latency gets too high the connection to storage is lost, hence the APD.

VMWare initially sent me to the storage vendor (Falconstor), who was able to clearly show that they were not the problem. VMWare then moved my issue to their storage team; they identified an issue with the firmware on the HBA (QLogic 2462) crashing and reloading. Per VMWare, QLogic indicates this needs to be addressed by the hardware vendor. So it was off to Dell, who advised me to change the QLogic HBA from INTx mode to MSI mode. That is the only change that has been made to the server since this all started.

Things are certainly working better. My hesitance in saying they're fixed is twofold:

  • I have no info on the impact of the different interrupt modes and the only reference I can find to changing modes is a very rare problem on a combination of hardware and ESX versions, none of which apply to my scenario. Additionally, none of my other hosts even show which interrupt mode they are in as the info is simply missing from the expected location in the config.
  • During a reboot of the server, the following messages appear in vmkernel.log:

          MSI: Enabled (which is expected)

          MSI: Falling back to INTx mode -- 0 (which would mean to me the HBA is going back to the mode that theoretically is the problem)

I am pushing both Dell and VMWare for more info but having a hard time with both companies. Any feedback anyone has here is appreciated.


Thanks,
James

0 Kudos
JPM300
Commander
Commander

It deffiently sounds like it is HBA realted then if that has somewhat resloved your problem switching modes.  I know with a lot of QLOGIC and EMULEX cards the Firmware and driver levels have to match or be in a certian range for them to work properly.  Could maybe be a firmware / driver problem?

You could try swapping out the QLOGIC card for another card of a different model and see if the problem goes away, or if you have a test host set that host up with iSCSI to the same DATAstores as the rest of your cluster and watch the DAVG / performance to see if you have any other issues with it, if you don't then you can probably safely say its a HBA issue on some level.

I would also keep pushing the vendors for support till you are happy or feel the product has stablized as lets face it, thats the reason as to why your paying the support contracts Smiley Happy

Thanks for the update, and I hope it all gets stablized soon.

0 Kudos
jprudente
Contributor
Contributor

Looks like we've got this solved, though honestly it will take a bit of time before I have complete confidence in this server.

I had to push Dell rather hard but eventually got in touch directly with one of their high-level VMWare guys. He confirmed my suspicion above, namely that if the logs showed the card falling back to INTx mode, it was, in fact, in INTx mode and thus nothing had changed.

As it turns out they have seen a rare issue where there is an incompatibility between ESXi 5.1, QLogic 24xx series HBAs, and Ivy Bridge v2 CPUs. My problem system met all of those criteria, and what's particularly interesting is I have another (working) system that is virtual identical except it's an Ivy Bridge v1 CPU. That would appear to give some weight to their findings.

Ironically I had earlier switched out the HBA for another, known-good HBA, but that was also a 24xx series and thus exhibited the same problem.

Dell graciously sent me a 26xx series HBA to replace the 24xx, and so far, so good. The new HBA is running in MSI-X mode which is apparently the actual fix, but the 24xx won't support MSI-X.

JMP300, I really appreciate your help with this along the way. To be honest I wish VMWare support had been as on-top of things.

James

0 Kudos
warring
Enthusiast
Enthusiast

Can also run a esxcli storage san fc stats get, check for errors on tx count etc, that will normally tell you if there are HBA issues. Also those errors point to local hypervisor seen issues with USB hypervisors dropping out, this would give an error in vsphere for the host though

VCP510-DCV
0 Kudos
JPM300
Commander
Commander

Nice, thanks for the update!,  Glad to hear you got it sqaured away.  Those 1 off issues are always the worse and take a lot of back and forth before you finally get it squared away.

Glad to hear Dell stepped up and replaced the HBA/Resloved the issue for you.  In the past I have always had good support for them, heck one time a Dell Equallogic failed on me due to a firmware issue, I'm pretty sure I could of got them to wash my car for me :smileysilly:

0 Kudos
chicagovm
Enthusiast
Enthusiast

I would like to respond with first our configuration and then how we are experiencing the same issues:


We have DL380p Gen8 hosts 12-core which have 2 single port HPAK344A ( HP branded Q-logic ) cards in the top slots 1 / 4.

We then have 2 Qua-port NICs in slots 2 / 5.

We have noticed the same type of errors logged shown below..

2014-0x-16T18:31:19.484Z cpu2:8740)<3> rport-8:0-0: blocked FC remote port time out: saving binding

Path redundancy to storage device naa.60a98000572d4d35673454474d67594e degraded. Path vmhba3:C0:T1:L9 is down

Only the 2nd HBA on the all hosts are having issues. [vmHBA3]  We have tried upgrading FW / BIOS / Drivers but no luck yet.

We next relocated the HBA dropping SAN packets / paths to another slot. So far no issues, but I am not confident that is the fix.


The differences in the two HBAs are the following:

HBA which does NOT drop SAN paths:

MSI-X enabled
Request Queue = 0x88011000, Response Queue = 0x88052000
Request Queue count = 2048, Response Queue count = 512
Number of response queues for multi-queue operation: 2
CPU Affinity mode enabled

HBA which DOES drop SAN paths:

INTx enabled
Request Queue = 0x880f0000, Response Queue = 0x88131000
Request Queue count = 2048, Response Queue count = 512
Number of response queues for multi-queue operation: 0
Total number of interrupts = 416686
Device queue depth = 0x40

So, INTx enabled is the difference in all 4 hosts it shows enabled.

When does this get enabled and why does the other using MSI- enabled ?

We are working HP / VMware / EMC.. but no luck.

0 Kudos
WessexFan
Hot Shot
Hot Shot

Just a quick note: I've seen this before (long time ago) and the issue was a bad SFP or HBA port gone bad and every variable was trying to load balance each other and ended up tanking performance. vmhba1:C2:T0:L0 is having errors so can you turn this port off on the fibre switch?

VCP5-DCV, CCNA Data Center
0 Kudos
jprudente
Contributor
Contributor

chicagovm, presumably you've read my posts and saw that in our case we ran into a very specific set of problematic circumstances. I don't think it's out of the realm of possibility that the HBA works properly in one slot and not the other, although I completely understand the lack of confidence in that as a fix.

I think you're on the right track in wanting to know why one card shows MSI-X and the other shows INTx. Have you enabled debug logging on the HBAs and confirmed that one is actually in MSI-X mode? During the troubleshooting process, we explicitly enabled MSI mode (we knew the card didn't support MSI-X) and even though the HBA properties showed MSI mode, the debug logging showed it was attempting MSI mode and falling back to INTx. I wonder if that's part of what's happening here.

Have you swapped HBAs and confirmed whether the problem occurs with whichever card is in the problem slot?

James

0 Kudos
chicagovm
Enthusiast
Enthusiast

  • During a reboot of the server, the following messages appear in vmkernel.log:

          MSI: Enabled (which is expected)

          MSI: Falling back to INTx mode -- 0 (which would mean to me the HBA is going back to the mode that theoretically is the problem)

Just checked my logs and I have that same entry.

2014-07-22T12:50:13.972Z cpu4:8883)<4>qla2xxx 0000:21:00.0: MSI: Falling back to INTx mode -- -22.

2014-07-23T01:13:58.961Z cpu24:8883)<4>qla2xxx 0000:27:00.0: MSI: Falling back to INTx mode -- -22.

0 Kudos
jprudente
Contributor
Contributor

Well that's one question answered at least. Have you confirmed if the HBAs support MSI-X mode? In our case, Dell said MSI-X (not MSI) was the fix, but we needed to upgrade to a HBA that supported it.

Out of curiosity, what type of CPU is in these servers?


James

0 Kudos
chicagovm
Enthusiast
Enthusiast

Have you swapped HBAs and confirmed whether the problem occurs with whichever card is in the problem slot?

We have not, but...thank you for the idea. It's difficult to recreate the issue as it happens very randomly with high I/O and/or when VADP backups take place.


We did however already swap slots for 2 hosts and the issue has not YET returned.

0 Kudos
chicagovm
Enthusiast
Enthusiast

How would you figure out if the HBA HPAK344A:ISP: ISP2532 supports MSI-X mode?

The CPUs which are all simliar in all hosts with the issue is:

Intel Xeon E5-2695V2 - 2.4 GHz - 12-core - 24

threads - 30 MB cache - LGA2011

Is the one same as yours? :smileycry:

0 Kudos
jprudente
Contributor
Contributor

Our problem host was an E5-2650v2, so yeah, same architecture.

With the HBA, I'd suggest figuring out which QLogic model it corresponds to and finding a spec sheet. HP may have that info but we're a Dell shop so I'm not too familiar as to their documentation.


James

chicagovm
Enthusiast
Enthusiast

I just found the corresponding Qlogic card for the AK344A is QLE2560. Searching spec sheet, but we may also try to ENABLE MSI-X on the HBA with issues currently showing INT-X as enabled.

0 Kudos
chicagovm
Enthusiast
Enthusiast

Yet, we have 2 HBAs per host. One is showing MSI-X enable and the other INTx enabled. How do I neable MSI-X on the one showing INTx ? OR does the command below enable it for any HBA which is not??

:smileycry:

Enable MSIx interrupts on the host with:

# esxcfg-module -s "ql2xenablemsi=1" qla2xxx

0 Kudos