Solved: lost access to volumes on SAN

Budrumi · ‎06-04-2010

Hi.

we have a problem where our ESXi servers keep losing access to individual volumes on a SAN.

The exact message is "Lost access to volume 4bf11fe8-2cf95c45-759a-0024e83a5515 (LUN 3) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly."

Few seconds later, the access is restored with message "Successfully restored access to volume 4bf11fe8-2cf95c45-759a-0024e83a5515 (LUN 3) following connectivity issues."

Several days ago, there were also errors of "Path redundancy to storage device naa.60022190008a4e3c000058274bf0a9ff degraded. Path vmhba35:C3:T0:L3 is down. 3 remaining active paths. Affected datastores: "LUN 3", although this happened only on one of the servers.

This happens only for some LUNs, sometime more than one at once, different set of LUNs etc., ie. it seems rather random. It happens several times throughout the day at different times on different servers.

Due to failover, there's only very short perceivable moment of VM slowdown for a second or two (I guess before failover path kicks in). Using Round robin for all paths.

I've read the KB article on general iSCSI connectivity troubleshooting, but I can't find anything. Since this is ESXi, I can only vmkping one storage IP at a time from the unsupported service console and I'm not noticing any issues, but this might be due to the fact, that path other than the one I'm pinging is currently failing.

The connectivity loss takes anywhere from 5-50s and happens on 3 of our 4 ESXi servers. These 3 servers have build 244038, the 4th server where this issues doesn't occur has build 181792.

These issues are also confirmed by Veeam Monitor that reports Command aborts > 0, although not every time when volume access is lost.

Any assistance is appreciated.

sathyajith · ‎06-04-2010

Hi Budrumi,

Yes if you have not deleted the lun’s as stated by VMware, the normal reboot should fix it most of the time if the machine is not part of a cluster. If this done not fix it you can manually remove them using the process mentioned below.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101508...

It’s also recommended to have all the ESX in the same build number inside a cluster.

Regards

Jith

View solution in original post

sathyajith · ‎06-04-2010

Hi,

The Porblem you have is addressed in

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1009557&sl...

To me this looks like an APD problem too. Type the below link and hope it solves your issue.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101629...

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1016626&sl...

Regards

Jith

Budrumi · ‎06-04-2010

Hi Jith,

I've seen and read that KB article, but it doesn't help much, as the connectivity is restored relatively quickly and the ESXi servers can see all the LUNs fine.

As for the the 2nd link, that's for build 219382 and we have newer build, so I guess this shouldn't occur.

And as for APD - wouldn't that be reported in ESXi events? Or do you mean that these connectivity lost messages might indicate APD state? There are no HW issues on the SAN or servers as far as I can tell.

sathyajith · ‎06-04-2010

Hi Budrumi,

The APD issue with screen shots on how things look when it happens and the places to look for the logs are in the link below.

http://ict-freak.nl/2010/02/25/vsphere-apd-bug-is-solved-in-patch-esx400-200912401-bg/

I’m also assuming that you have checked the HBA / Storage for VMware HCL

logs : http://www.vmware.com/support/vc13/doc/c1viewlogs20.html ( VMkernel for you )

Hope this helps

Regard

Jith

Budrumi · ‎06-04-2010

Hi Jith,

thanks for the hints.

There might have been "incorrectly" removed LUNs (some may have been deleted on the SAN level instead of the following the "proper" procedure) and since we have later build than the one that contains the patch, can I assume that host reboot should fix this?

SAN and servers are from Dell and they're on HCL.

sathyajith · ‎06-04-2010

Hi Budrumi,

Yes if you have not deleted the lun’s as stated by VMware, the normal reboot should fix it most of the time if the machine is not part of a cluster. If this done not fix it you can manually remove them using the process mentioned below.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101508...

It’s also recommended to have all the ESX in the same build number inside a cluster.

Regards

Jith

Budrumi · ‎06-04-2010

Whoa, that's a LOT of manual instructions for (such a basic operation IMHO) properly removing a LUN.

Thanks for the info, I admit I didn't know about these procedures, I always assumed that deleting the LUN or removing its access on the SAN level was good enough, but apparently not.

Since these are production servers, we can't reboot them anytime soon (unfortunately no vMotion here), so thank you for now

Budrumi · ‎06-09-2010

Unfortunately, rebooting the ESXi server didn't not help, the connection problems messages started appearing again after roughly an hour after the reboot....

Any other ideas?

stefanjansson · ‎06-14-2010

Hi ,

this could be the same problem that we are having in our environment. I haven´t got any solution yet (over two months since I logged a case) but we have upgraded SAN-switch firmware ,XP-firmware and other stuff also ,but nothing has helped so far . I still believe that something has changed in the ESXi build that is causing this in some sites. Since we have exactly the same hardware and configuration in another ESXi cluster and is not seeing these problem there. The only thing that differs is the ESXi build. This fits well with your ESXi host with a lower build that is not seeing these problems...

You can see our discussion about this problem at :

http://communities.vmware.com/message/1521105#1521105

regards Stefan

All

lost access to volumes on SAN