Solved: Re: Lost access to volume following 4.1 upgrade

Kimbie · ‎10-03-2010

Setup

1 x HP c7000 Blade enclosure

3 x HP BL460cG6 with dual 5540 Xeons and 48Gb RAM, QLogic ISCSI HBA cards

3 x HP P4300 Lefthands

4 x Cisco 3020 blade switches in the back of the c7000, 2 x dedicated for iSCSI traffic

vSphere Server running 4.1

The Problem

We have just gone through the process of upgrading our vSphere server from 4.0 to 4.1 to manage a standalone ESXi 4.1 system, so our attentions turn to our 3 blades running ESXi 4.0u1 we used the built in Update Manager, downloaded the 4.0 to 4.1 upgrade file and we upgraded our first blade and we did not notice any issues as the servers put on there were low use ones. We then upgraded the second and moved our primary mail server onto it.

It was when we did this we started to get errors where people were losing connection to the exchange server, after some investigation we Event Views and we were seeing the error:

"Lost Access to volume due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly"

Then approx 10 seconds later we get the message:

"Successfully restored access to volume following connectivity issues.

This error only occurs on the 4.1 blades, we rolled back a blade to 4.0 and the errors did not get displayed and no errors were reported with the servers on that blade. So as far as we can tell it is not a networking issue since the iSCSI traffic for all blades flows over the same switches to the Lefthands, and we were loosing connection to volumes on both Lefthands.

We have a call logged with HP on this, but as of yes we can not determine what is causing this issue, nor how to resolve it.

So any help is greatly appricated

Thanks

Dave

ydjager · ‎02-07-2011

This is due to a combination of ESX(i) 4.1, QLogic HBA's and Lefhand SAN/IQ version prior to version 8.5 with patch 10092-00. I made a blogpost about his problem:

http://yuridejager.wordpress.com/2011/02/07/lost-access-to-volume-error-with-hp-lefthand-san-storage.... I hope this helps.

View solution in original post

Kimbie · ‎10-05-2010

All these experts and no one has any idea on how to resolve the problem?

If if anyone has a suggestion, or needs more info please ask

Thanks

Kimbie

Kimbie · ‎10-18-2010

Does anyone have any suggestions on what I can try?

I have checked the HCL and I have the driver required or newer for the various hardware.

HP are not being much help at the moment so any help or suggestions is more than appricated

Thanks

Kimbie

GreyhoundHH · ‎10-20-2010

Unfortunately I don't have a solution for this issue, but we're experiencing the same problem.

We've setup two new servers (Fujitsu RX200S6 with QLE2460 HBAs) with ESX 4.1. They are attatched two the same set of LUNs as our other five servers (ESX 3.5). Only the two new systems show these events about loss/restore of the SAN-connections.

I've filed a SR @ VMware regarding this issue, but there's no solution yet.

Kimbie · ‎10-20-2010

Thanks for the reply, nice to know someone else is having the issue.

We have it logged with HP as well. They did send me a link to this KB article

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=102936...

Now this did not fix our issue, plus it does not persist through a reboot, but might be worth trying to see if it fixes it.

In the article it shows the below commands:

vsish -e set /net/pNics/vmnic1/hwCapabilities/CAP_IP6_CSUM

0

vsish -e set /net/pNics/vmnic1/hwCapabilities/CAP_IP_CSUM

0

It should look like

vsish -e set /net/pNics/vmnic1/hwCapabilities/CAP_IP6_CSUM 0
vsish -e set /net/pNics/vmnic1/hwCapabilities/CAP_IP_CSUM 0

Regards

Kimbie

GreyhoundHH · ‎10-20-2010

Ok, I forgot to mention that we're having this issue via FibreChannel not iSCSI.

kwolton · ‎01-06-2011

Hi,

Did you ever find a solution to this???? I have exactly the same issue, really really annoying!!

Please see my Vmware Post http://communities.vmware.com/thread/297185?start=0&tstart=0

Really really annoying this one as this was meant to go into production a month ago!!

Kind Regards

Kris

Kimbie · ‎01-06-2011

Some new drivers have been released by VMWare which we have yet to test, though we will hold up and wait for a update to 4.1 as we do not want ot have to mess about with installing esxi then drivers.

Kimbie

GreyhoundHH · ‎01-06-2011

I have no soulution for our issue yet and I have still a SR open with VMware.

What I've learned from VMware during our investigation, is that (in our case) these errors are related to SCSI reservartion conflicts. The conflicts are also present on our five ESX 3.5 servers, but via vSphere-Client they are only logged on ESX 4.x servers. Via console you can observe the same errrors on both ESX 3.5 and 4.x...

We're still looking for the root cause of the conflicts, but we're not really moving forward 😕

Pylortes · ‎01-13-2011

do you do any replication between your SAN's? if we stop replication the issue stops, but we still see the same problem you are seeing while we are replicating. We also have had a support case in on this.

(2) IBM DS4700's

(2) SVC Clusters

ydjager · ‎02-07-2011

This is due to a combination of ESX(i) 4.1, QLogic HBA's and Lefhand SAN/IQ version prior to version 8.5 with patch 10092-00. I made a blogpost about his problem:

http://yuridejager.wordpress.com/2011/02/07/lost-access-to-volume-error-with-hp-lefthand-san-storage.... I hope this helps.

Kimbie · ‎02-16-2011

Thanks that has resolved our problem

Kimbie

DDunaway · ‎03-01-2011

I am having the same issue with an HP EVA 6400 Fibre Channel SAN using HP BL460c hosts.

We noticed we were having these same error messages after upgrading to ESX 4.1. We have even migrated all our hosts to ESXi 4.1 and the problem persists. I came across this in the 4.1 u1 release notes. It is marked as not being previously documented. it states that the work around is to install cache memory modules for the local storage array controllers. I have not been able to get HP or VMware to confirm that this will actually resolve our issue.

These errors below would be seen on the host from the CLI immediately after losing connectivity to a volume. pasted from the 4.1 u1 release notes.

Slow performance during virtual machine power-on or disk I/O on ESXi on the HP G6 Platform with P410i or P410 Smart Array Controller *
Some hosts might show slow performance during virtual machine power-on or while generating disk I/O. The major symptom is degraded I/O performance, causing large numbers of error messages similar to the following to be logged to /var/log/messages:
Mar 25 17:39:25 vmkernel: 0:00:08:47.438 cpu1:4097)scsi_cmd_alloc returned NULL
Mar 25 17:39:25 vmkernel: 0:00:08:47.438 cpu1:4097)scsi_cmd_alloc returned NULL
Mar 25 17:39:26 vmkernel: 0:00:08:47.632 cpu1:4097)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x410005060600) to NMP device
"naa.600508b1001030304643453441300100" failed on physical path "vmhba0:C0:T0:L1" H:0x1 D:0x0 P:0x0 Possible sense data: 0x
Mar 25 17:39:26 0 0x0 0x0.
Mar 25 17:39:26 vmkernel: 0:00:08:47.632 cpu1:4097)WARNING: NMP: nmp_DeviceRetryCommand: Device
"naa.600508b1001030304643453441300100": awaiting fast path state update for failoverwith I/O blocked. No prior reservation
exists on the device.
Mar 25 17:39:26 vmkernel: 0:00:08:47.632 cpu1:4097)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x410005060700) to NMP device
"naa.600508b1001030304643453441300100" failed on physical path "vmhba0:C0:T0:L1" H:0x1 D:0x0 P:0x0 Possible sense data: 0x
Mar 25 17:39:26 0 0x0 0x0

Workaround: Install the HP 256MB P-series Cache Upgrade module from http://h30094.www3.hp.com/product.asp?mfg_partno=462968-B21&pagemode=ca&jumpid=in_r3924/kc

We have ordered these cache modules and have begun to install them. I noticed that the HP BL460 G1 blades already have cache memory on their local storage. Our G6s did not come with cache memory at all.

Crossing our fingers, hoping this resolves the issue.

David

Yann59 · ‎03-09-2011

I also have the same problem with a SAN MSA2312i With 7 disks in RAID5 of 450GB15k, 2 HP DL360g7 in iscsi

For the moment no solution

MaximZ · ‎03-17-2011

Same issue a cross of multily hosts after upgrade to ESXi 4.1.0 348481.

Running Dell 1950/R610 connected via iSCSI to MD3000i.

Tried Round Robin, Most Recent Used, Fixed.

Still have an issue 😞

No real solution nether from Vmware nor Dell 😞

-- Maxim

Josh26 · ‎03-17-2011

Hi,

You may have a similar symptom but you really don't have the same problem.

The problem being discussed is early versions of Lefthand software. If you had the same issue I would ask you to update to the current Lefthand software. Since you appear to not be running this, I would recommend starting a fresh thread discussing exactly your problem.

Yann59 · ‎03-25-2011

Have you done a test with 4.1 update 1?

AlexLudwig · ‎03-26-2011

We have the same Problem since a few days, since I updated some of the Blades to newest updates. But going back to old version did not help. Maybe it was just coincidence.

We are using ESXi 4.1 Update 1 within c7000 Enclosure, Blades BL460c G6 and G7, Adapters Emulex LPe 1105 & LPe 1205, Storage is EMC Clariion CX4 Raid 5 and Meta Luns and we are connected through Flex 10 & an 8/20 San Connector.

We have no solution yet, but I have a support request open @ EMC, @ VMWare and @ HP. I keep you posted if I got something new. Very annoying...

CieNum · ‎03-29-2011

Hello,

Same problem Here :

MD3200i ISCSI / R610 ESXi Host

ESXi 4.1 348481 :

Lost access to volume
4d8f4f04-2ae8d47c-def8-f04da2003177 (MD3200-1-1T-
XXX) due to connectivity issues. Recovery attempt is
in progress and outcome will be reported shortly.
info
29/03/2011 09:24:15
esxi38.vsphere.xxx.fr

Successfully restored access to volume
4d8f4f04-2ae8d47c-def8-f04da2003177 (MD3200-1-1T-
XXX) following connectivity issues.
info
29/03/2011 09:24:18
esxi38.vsphere.xxx.fr

It is really annoying. I have setup all the updates, even on our SAN, no changes...

Thanx for your suggestions..

AlexLudwig · ‎03-30-2011

Following up the HP suggestion, we have to update at first the Firmware of the BladeEnclosures (like I expected that to be the first answer...). They have a know issue:

• Resolved loss of FC connectivity when an 8Gb/24-Port VC-FC Module receives a multi-sequence frame from an in band Storage Management application (application was executing on a server within the c7000 enclosure). Resulted in a 8Gb/24-Port FC Module reset and loss of FC connectivity.

• Resolved an issue when the VC-FC 24-Port Module would not recover from an NO-COMM state after an OA failover. The IP address for the FC Module was not being updated correctly, and blocking the proper communication with the Primary VC Module.

• Resolved an issue where VCM reported a NO-COMM state for a VC-FC module, even though the VC-FC module was still responding to a ping command.

Am I right that you have a c7000 as well?

Second thing: We use MRU to access our SAN Luns. But with Flare 29 it is not longer recommended. So we have to change to Fixed or RR with alua and powerpath. But I will first do the Firmware Update and I still wait for another answer of EMC.

One thing with the VMWare support was that we had a look at esxtop and we saw a latency of 10ms from ESXi to SAN within a load of 5MB read/write.... I don´t want to see the latency when we have real traffic on our san..

Just found this regarding ALUA: http://communities.vmware.com/message/1615594