VMware Cloud Community
rwh23
Enthusiast
Enthusiast

Lost Access to Volume (continous message)

I'm getting the following messages for one of the 4 hard drives connected to a VM

Lost access to volume

4bcce772-3bfe7a35-dceb-001b21541d90 (1_5WD

1) due to connectivity issues. Recovery attempt

is in progress and outcome will be reported

shortly.

info

4/21/2010 1:16:16 PM

and then not even a second later:

Successfully restored access to volume 4bcce772-

3bfe7a35-dceb-001b21541d90 (1_5WD_1_)

following connectivity issues.

info

4/21/2010 1:16:16 PM

This message will continue on and off throughout the day. Eventually it says Lost access and never restores access. When this happens that VM completely freezes and I cannot do anything with it (have to reboot ESXi at this point)

The hard drive in question is 1 of 3 hard drives connected to the VM for a storage pool (4th VM HDD is from datastore as system disk). Its always the same hard drive and the other 2 storage drives don't have an issue.

Is the hard drive going bad? HD test came up clean on that drive.

I'm at a complete loss why this one hard drive is continously having these issues. I tried to look at the hostd logs, but I'm having a hard time deciphering them (time seems to be off compared to whats showin in vShpere).

Did a search, but nothing came up in regards to my specific issue.

Info:

ESXi v4.0.0 Build 244038

AMD Phenom 9850 BE

GB MB: BA-MA-770-UDR3

8GB DDR2 800

3 Replies
admin
Immortal
Immortal

Do you have another spare controller in the machine you could try connecting the drive to? Otherwise I would guess the drive is going bad.

rwh23
Enthusiast
Enthusiast

I can try plugging it into another SATA port and see if that helps.

Reply
0 Kudos
virtualworld199
Contributor
Contributor

From the Logs it looks like it has problem with HA (Heartbeat) , You try the below solution or Remove the LUNs from heartbeat.

Please  Mark the reply correct or Helpful.

This event indicates that the ESX host's connectivity to the volume (for which this event was generated) degraded due to the inability of the host to renew its heartbeat for period of approximately 16 seconds (the VMFS lock breaking lease timeout). After the periodic heartbeat renewal fails, VMFS declares that the heartbeat to the volume has timed out and suspends all I/O activity on the device until connectivity is restored or the device is declared inoperable.

There are two components to this:

  • Heartbeat Interval = 3 Seconds

  • Heartbeat lease wait timeout = 16 Seconds
A host indicates its liveness by periodically (every 3 seconds) performing I/O to its heartbeat on a given volume. Therefore, if no activity is seen on the host's heartbeat slot for a period of time, then we can conclude that the host has lost connectivity to the volume. This wait time is a little over 5 heartbeat intervals or 16 seconds to be precise.

Example

If an  ESX host has mounted a volume san-lun-100 from device naa.60060160b4111600826120bae2e3dd11:1 and loses connectivity (due to a cable pull, disk array failure, and so on) to the device for a period exceeding 16 seconds, the following error message appears:

Lost access to volume 496befed-1c79c817-6beb-001ec9b60619 (san-lun-100) due to connectivity issues.  Recovery attempt is in progress and outcome will be reported shortly.

Impact

All I/O, metadata operations to the specific volume from COS, user interface (vSphere Client), or virtual machines are internally queued and retried for some duration of time. If the volume or storage device connectivity is not restored within that duration of time, such I/O operations fail. This might have an impact on already running virtual machines as well as any new power on operations by virtual machines.

 

Solution

To resolve this issue:

  1. Connect to the vCenter Server using vSphere Client.
  2. Select the Storage View tab to map the HBA (Host Bus Adapter) associated to the affected VMFS volume.

  3. Follow the steps provided in Troubleshooting fibre channel storage connectivity (1003680) to identify and resolve the path inconsistencies to the LUN.

  4. If connections are restored, VMFS automatically recovers the heartbeat on the volume and continues the operation.

To resolve this issue using the service console:

  1. Connect to the ESX host’s service console.
  2. Run the following commands:
    1. Query VMFS datastore properties. 

      Example:

      # vmkfstools –P san-lun-100
      File system label (if any): san-lun-100
      Mode: public
      Capacity 80262201344 (76544 file blocks * 1048576), 36768317440 (35065 blocks) avail
      UUID: 49767b15-1f252bd1-1e57-00215aaf0626
      Partitions spanned (on "lvm"): naa.60060160b4111600826120bae2e3dd11:1
    1. Use esxcfg-mpath along with the naa ID of the LUN (Logical Unit Number) output from the above command to identify the state of all the paths to affected LUN.

      Example:

      # esxcfg-mpath -b -d naa.60060160b4111600826120bae2e3dd11
      naa.60060160b4111600826120bae2e3dd11 : DGC Fibre Channel Disk (naa.60060160b4111600826120bae2e3dd11) vmhba0:C0:T0:L0 LUN:0 state:active fc Adapter:
      WWNN: 20:00:00:00:c9:7d:6c:e0 WWPN: 10:00:00:00:c9:7d:6c:e0  Target: WWNN: 50:06:01:60:b0:22:1f:dd WWPN: 50:06:01:60:30:22:1f:dd vmhba0:C0:T1:L0 LUN:0 state:standby fc Adapter:
      WWNN: 20:00:00:00:c9:7d:6c:e0 WWPN: 10:00:00:00:c9:7d:6c:e0  Target: WWNN: 50:06:01:60:b0:22:1f:dd WWPN: 50:06:01:68:30:22:1f:dd
  3. Follow the steps provided in Troubleshooting fibre channel storage connectivity (1003680) to identify and resolve the path inconsistencies to the LUN.
  4. If connections are restored, VMFS automatically recovers the heartbeat on the volume and continues the operation.