Lost access to volume issues

nimos001 · ‎05-20-2011

I currently have a support case open with vmware regarding this issue but wanted to post in the forums as well incase someone has run into this issue. I am currently experiencing an issue with my storage volumes loosing connectivity for several seconds and then being restored. This is impacting VM response and performance in my environment. It also hampers my ability to vmotion guests around in order to perform maintenance for some of the troubleshooting since during a migration the particular storage volume the guest is on might become disconnected. Very frustrating!! I am current with all driver versions for the hardware involved and even tried forcing my HBA's down from 8gb to 4gb but the problem remains. This is a brand new setup in a new datacenter that we are currently migrating to. The previous datacenter we were in we are using DS8300 and XIV storage without any issues. Does anyone have any suggestions?

Environment:

ESXi 4.1 U1 hosts

IBM HX5 blade

IBM X3850 X5

IBM DS8700 FC storage

Brocade FC switch

QLogic HBA's

DSTAVERT · ‎05-20-2011

The only suggestion is sift through the logs. There will be indications of the issue or issues in, the vmware logs or the FC switch logs or the SAN logs.

-- David -- VMware Communities Moderator

jcwuerfl · ‎05-20-2011

I've seen this on mine as well. Where the datastores very quickly will disconnect then reconnect. You can see all of those errors in the event logs. What I did was go through the event logs and basically write down ALL of the times it was happening. To see if was the same time(s) during the day or if it was totally random. I saw there was definatally a pattern to mine and narrowed down some settings on the storage where the top controller had some different settings than the bottom controller. Netapp. Also, found that by disabling some of the aggregate / volume snapshots and snapshot deletions the errors went away or at least I saw way less disconnect/reconnect errors. No idea why these were causing issues as they shouldn't cause any issues really. So I guess that's the first thing I would recommend to do is to try and see if tehre is any pattern to when it is happening and then see if there are any automated processes going on that may be causing it. aka backups, snapshots, etc. etc. during those times. btw: Its always good to keep VMware up to date, along with any firmware on the servers. I know you said drivers? but firmware as well for everything, including HBA cards so that you don't have to "re-invent" the wheel so to speak if was already fixed in a newer version. I should say that I also was not running ALUA and also switched everything to use ALUA with RR for my multipathing and storage setup. Sort of depends on your SAN with that but if its an Active/Active setup then that helps make sure that its going down the Primary Paths as it should.

Hopefully some of this helps. Also agree, check the vmkernel logs and check errors and see if you can find a match out there as well. These are generally very frustrating as IBM and vmware like to point at each other as well as the FC switch if any vendor. I think that's anothe rreason why I'm starting to like the UCS idea. where there's lots of people all using the same hardware and verified configurations.

THanks

jesse_gardner · ‎05-31-2011

Nimos, did you ever get a resolution for this? I've been troubleshooting this myself as well. We've got about 6 hosts between 3 chassis that are exhibiting this. HS22V's, Qlogic HBAs (QMI 2572), Brocade 8gb switches, fiber channel out to EMC Symmetrix and Clariion storage. For each of the impacted hosts, one of the fiber switch ports shows extremely high error counts, fortunately never both. Both ESXi 4.0U2 and 4.1U1.

Vicente_Romero · ‎06-30-2011

There are two main issues; check both of them and try:

1.-

Vmware KB: 1030265. --> HS22v but also applies to HS22.

ESX 4.1 introduces interrupt remapping code that is enabled by default. This code is incompatible with some IBM servers. You can work around this issue by manually disabling interrupt remapping.

2.-

Brocade SAN Switch configuration.

You need to set the correct value for FillWord in your SAN switch.

portcfgshow shows FillWord values

If you connect at 4Gb, FillWord must be 0

If you connect at 8Gb, FillWord must be 1 or, if a last firmware levels, 3.

portcfgfillword 1 0 --> sets Fillword=0 in port 1.

Good luck!

Rick2010 · ‎09-26-2011

I had the same issue so went through everything through the OSI model layers (Application, Presentation, Session, Transport, Network, Data, Physical). I would have saved a lot of time if I worked through the layers the other way around as the culprit was a faulty HBA card!

Environment:

ESXi 4.1 U1 hosts

HP BL460c blade

HP EVA storage

Brocade FC switch

QLogic HBA's

I would start isolating each host by putting them in maintenance mode for 24-48 hours until the cluster stopped producing the errors. Then replaced the HBA on suspect host.

Hope this helps someone else...

jesse_gardner · ‎12-01-2011

Updating my own post here:

After extensive troubleshooting with IBM, VMware, and EMC, we were unable to get to a simple root cause for this. We found a pretty good workaround in house through trial and error. The main symptom that can be easily seen is an extremely high port error count on the fiber switch ports. Millions and even billions of errors were witnessed in our environment. While noticed first in VMware, upon further investigation we found the same issue on Windows and Linux blades that "never seemed quite right" anyways.

In short, our workaround can be summarized as: "Adjust HBA and switch port speed settings until it works well and port errors are minimal." Our starting point is to hardcode 4gbps on both the HBA and switch port for 8gbps switches, and 2gbps for 4gbps switches. Seem strange, but that's what usually works. Sometimes the ports need to be disabled and re-enabled, or turned to Auto speed and then back (to 4 or 2 gbps) before they'll work properly.

We've found a couple evil, evil ports. We've replaced HBAs and entire blades, and the blade in that slot just never works right. Moving the blade to a slot right next to it will make the blade work fine (effectively different fiber port, same fiber switch and blade chassis).

Different firmware revisions, HBA configurations, switch port configurations (fillword, etc) have had no conclusive impact on this issue.

Some notes about our environment:

We have both 4gbps (32R1820) and 8gbps (44X1926) Brocade switches
We have both 4gbps (49Y4237) and 8gbps (44X1948) Qlogic HBAs
Our "Core" switches are old and only support 2gbps, so the upstream speed configurations have no impact on true bandwidth to storage.
Our Brocade switches are in Gateway mode, which I gather is probably not the most popular configuration. IANASG (I Am Not A SAN Guy).
I just now noticed that some of our blades are in an unsupported configuration (http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5083850), though many of our problems are from supported configurations as well.

Something is not quite right about these newer Qlogic cards' interaction with the newer Brocade switches.

Phatlt · ‎10-28-2012

Hi Vicente Romero

I had some problems about lost device storage same as you. But my environment consists of Brocade FC modules (Brocade M5424) connected to 8Gb Brocade switches (brocade 300). And Could u answer some questions about configurations?
1. What impact if i configure fillword ? need reboot or drop traffic for short time?
2. What ports should i configure?
– Switch module: downlink to blade servers or uplink to brocade 300
– Brocade 300: port where switch modules connected.

Thks & Best Regards

nimos001 · ‎10-28-2012

What type of storage? (Brand and Model) Do you have VAAI enabled also on your hosts? If enabled, disable it and report back.

Phatlt · ‎10-28-2012

My storage's NETAPP FAS6280 and disable VAAI UNMAP's default my rule when provisioning new hosts

All

Lost access to volume issues