VMware Cloud Community
mnaitvpro
Enthusiast
Enthusiast

ESXi Lost access to volume


Hello Gurus,

Offlate i noticed an event log entry in vSphere client twice in short span of less than a day related to local storage as follows:

Lost access to volume

4f4c8bc0-4d13eab8-c8fc-5cf3fc09c3fa (vms-1)

due to connectivity issues. Recovery attempt is in

progress and outcome will be reported shortly.

info

8/18/2013 2:01:22 AM

vms-1

and immediately

Successfully restored access to volume 4f4c8bc0-

4d13eab8-c8fc-5cf3fc09c3fa (vms-1) following

connectivity issues.

info

8/18/2013 2:01:22 AM

nesxi1.x.x

The event details itself recommends "Ask VMware" link leads to VMware KB: Host Connectivity Degraded

and

this VMware KB: Host Connectivity Restored

As per the KB VMware is referring to SAN LUN, but in our case its the local storage, kindly shed some info as to why the local storage would lost its connectivity.

Note: all the local disk are on RAID-10.

thanks

Reply
0 Kudos
25 Replies
ssbkang
Enthusiast
Enthusiast

Hi,

Which ESXi version are you running on?

I had a similar issue with ESXi 5.1 no update and after patching it to the latest, ESXi 5.1 Update 1, the issue has been resolved.

Hope this helps,

Steven.

Reply
0 Kudos
mnaitvpro
Enthusiast
Enthusiast

ESXi Ver 5.0, Build 469512

Reply
0 Kudos
sebpiller
Contributor
Contributor

Hello,

I am experiencing the same issue since one week and this is corresponding to the upgrade of our ESXs to 5.1.0 1157734, but I'm not sure this is is related to.

Side effect are:

- Very high disk latency peaks (up to 10s!)

- Instability

- Lost of storage paths on some ESX.

- Inconsistencies of some virtual hard disk

Restarting the ESX solves the problem, but it comes back as soon as we have more disk access (i.e. during backup)

How did you solve the problem?

Thanks a lot for your feedback and best regard

Reply
0 Kudos
xiekai37
Contributor
Contributor

I'm running into same issues with SAN datastore (VNX5500 array).  I'm running ESXi 5.0 (1311175)

did you guys ever resolved the issues?

Thanks

Reply
0 Kudos
itvmmgrs
Contributor
Contributor

Exact same issue here.
It’s killing me.  5.1.0 (1612806).  All SAN (EMC CX4), Qlogic Fiber HBA’s and new
Dell R720’s.

It’s getting ugly.

Has anyone resolved this issue?

Reply
0 Kudos
btniko
Contributor
Contributor

Hi -

Same issue here with 5.0 and VNX 7500.

Has anyone resolved this issue?



Reply
0 Kudos
JValdez20111014
Contributor
Contributor

So ... Any news on this. Had the same issue for a while going to 5.1u2 this week. Did anyone else have luck resolving this?

Reply
0 Kudos
a_p_
Leadership
Leadership

I don't want to rule out anything. However, I had to troubleshoot an issue like this a few months ago, and it turned out that a bad fiber cable was causing the issue. You may want to check the FC switch ports to see whether the port(s) show e.g. CRC errors.

André

Reply
0 Kudos
gustavo_gro
Contributor
Contributor

hi!

Some problem here,the scenario is also similar.

Do you fix it this?! have any idea?!

Reply
0 Kudos
xiekai37
Contributor
Contributor

My issue is due to a bug between HP blade chassis virtual connect and Nexus 5000, but during my month long troubleshooting, I suggest anyone suffers this problem to look at everything.

1. check the HBA firmware/driver, some version of Emulex LOM have bugs that exhibit this behavior

2. if you use brocade FC swtiches with HP blades, check out the FillWord Value in your swtich config

3. If you use HP Virtual Connect Flexfabric with Nexus 5000 as your FC access swtich, there is a bug with 8GB FC, upgrade your virtual connect firmware or upgrade your nexus OS.

4. Upgrade your VNX flare code to december 2013 level, there is a dramatic improvement over ATS locking offload in that version of FLARE.

5. check to see if you array frontend ports are getting QFULL messages, if so, think about throttling the queue depth on the HBA, there is an ESXi setting for this.

6. check for bad fibre cable and SFP on and between the HBAs, FC Switches and Array.

Good luck.

Reply
0 Kudos
DyOS
Contributor
Contributor

Has anyone had any luck fixing this problem?  I have a WD iSCSI drive with the same problem.  I have to constantly reboot the ESX host and it is causing all of my servers to go down.

Reply
0 Kudos
HawkieMan
Enthusiast
Enthusiast

I would check MTU size for network Attached Storage.

For the others, i would suggest you check the battery on the raid Controllers, and thereafter did a check of all cables.

I have also seen this issue sometimes when servers have been installed With standard image instead for the hw vendor customized image.

Reply
0 Kudos
virtualworld199
Contributor
Contributor

What version of esxi are you on? if you are on esxi 5.1 update it to Esxi 5.5 will solve this issue.

Reply
0 Kudos
ravikarthi
Contributor
Contributor

I'm using ESXi 5.5 Build 1746974...I could see the error.


Not yet resolved..Any idea?

Reply
0 Kudos
bradley4681
Expert
Expert

I'm having this same issue on a few Cisco R210 and C240 UCS servers, all have local datastores using Megaraid controllers and running different versions of ESXi

Cisco C240 - esxi 5.0 - no issues

Cisco R210 - esxi 5.0 - disk access issue

Cisco C240 - esxi 5.1 - disk access issue

Cisco R210 - esxi 5.0 - disk access issue

Cisco R210 - esxi 5.0 - disk access issue

Cisco C220 - esxi 5.5 - no issues

First

Device

naa.600605b005df73201951a1d33bc62893

performance has deteriorated. I/O latency

increased from average value of 708

microseconds to 24612 microseconds.

warning

9/30/2015 4:46:09 AM

10.2.42.23

Lost access to volume

54383f2f-62e7730b-ec74-4c4e3544bf5e

(snap-0a1ec5ee-datastore1) due to connectivity

issues. Recovery attempt is in progress and

outcome will be reported shortly.

info

9/30/2015 7:39:33 AM

snap-0a1ec5ee-datastore1

Successfully restored access to volume 54383f2f-

62e7730b-ec74-4c4e3544bf5e

(snap-0a1ec5ee-datastore1) following

connectivity issues.

info

9/30/2015 7:39:46 AM

10.2.42.23

Capture.PNG

Cheers! If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
Reply
0 Kudos
ravikarthi
Contributor
Contributor

Are you using FI for your Rack servers or tradational method?

Reply
0 Kudos
nandt
Contributor
Contributor

hello in my case i was loosing connectivity to datastore and after some seconds restored..(vmware 6) ibm x 3550 m2, 4 ssd raid 5. i was looking and reading around more than a month.. after a powe loss and a ups ran out of batteries.. the system could't boot properly (more than 1 hour) .. so i start really examine the system and i found my little button battery was bad.. actually the battery itseld was ok 3 volts but the system show an error  ( a led on motherbaord) after i change it according to ibm instructions..( power off  etc.. ) the system volume works fine since.. its been 1 week without any errors.

Reply
0 Kudos
malabelle
Enthusiast
Enthusiast

HI.

we encountered the same problem. Still troubleshooting. We've been having conf calls with the Dell Master Engineer, 2 guys from Vmware, 2 from EMC, one from Brocade.

We have:

multiple Dell m1000e chassis

     - Dell m630 with qlogic qme2662 mezz cards

     - Brocade 6505 Chassis Switches

Multiple Dell r730xd

     - qlogic qle2662

Brocade 5100 Core FC Switch

The LUNs are on a EMC VNX 7600.

Vmware ESX 5.5 U2 and 6.0U1

We used Dell custom ISO, Vmware vanilla ISO.

On the Dell customs the qlnative driver is really new (v2.x).

We tried a lot of changes. Been fighting with this issue for 3 weeks now.

What we managed to find that seems to be working is the following.

Since we have a lot of older servers that works well, we added 4 new paths on the VNX for the new chassis and servers.

We downgraded the qlnative to the following:

qlnativefc-1.1.20.0

And we changed all at the same time. When we add one host with newer drivers, it seems to start again with the lost datastores...

I will keep writing in this post when I have something new, good or bad.

vExpert '16, VCAP-DCA, VCAP-DCD
Reply
0 Kudos
megrez80
Contributor
Contributor

I just started seeing this today as well with ESXi 6.0 on an HP Proliant DL380 G6. The volume is on internal hard drives.

It is never able to recover. The recovery process takes up lots of CPU, rendering the VMs unresponsive. The only way to recover is to power cycle the server.

Is this indicative of a failing hard drive or controller?

Thanks.

Reply
0 Kudos