Lost Access to Volume

edawg · ‎04-05-2010

Hello-

I have three Dell R710 hosts running ESXi 4, build 244038. Every 1 to 4 days an error generates on anyone of the three hosts stating, " lost access to volume (lists my volume name) Recovery attempt in progress. Within 1-5 seconds it states, "successfully restored access to (lists my volume name). During this window my datastores do not drop. Customers are not affected.

These three boxes are in a cluster with HA and DRS enabled. Each host has a physical nic in a vmkernel port group assigned to iSCSi data. On the network side the nics assigned to the vmkernel port group are in a separate vlan dedicated for iSCSI traffic connecting to an EMC AX4-5i iSCSI SAN on the same VLAN. The SAN has four 1 Gb iSCSI ports. In Navisphere Manager I do not see any errors being reported when this issue occurs. My network team has looked at all the traffic/ports/cables etc and states there are no issues.

All three hosts are synching time correctly with an internal ntp source. I have reviewed the VMware KB article which refers to a checklist of things to check for this issue and unfortunately none of them appear to apply to my situation. Any thoughts would be greatly appreciated.

Regards,

Erik

stefanjansson · ‎04-12-2010

Hello ,

I also experiencing these messages in one of my clusters. I´m running on HP BL490 hosts ,ESXi build 236512 but are using FC disks (XP24000) ,not iSCSI.

The scenario looks pretty much as above ,now and then one of my 8 ESXi hosts is reporting "lost access to volume ....." and after 1-10 sec there

is a corresponding message that states "Successfully restored access to ....". Can´t see that this is affecting the vm´s on the datastore

(nothing reported in event logs on windows hosts). In the same VC I´m running another ESXi cluster ,same hardware (HP BL490 ) configured exactly

the same as the other cluster, but is running ESXi build 208167.This cluster is not affected by this issue ,but it has disks in the same XP24000 .

The firmware on machines and HBA´s is the same on both clusters. I´m beginning to wonder if this is something that has "been introduced" in later

builds of ESXi ? I had a plan to upgrade to newer build but it seems as if that is not going to resolve the issue since you Erik has the latest

build on your cluster. I can see on the time that most of these messages arrives at intervals like 18.00 ,11.30,5.50,22.00,13.15 and so on ,which

leads me to think that this is happening when the automatic scanning of datastores from the ESXi is going every 5 minutes.

Are we the only one´s that is seeing these messages ?

regards Stefan

edawg · ‎04-12-2010

Thanks for the feedback. It is helpful to know this is not related to iSCSI as that has been the path I have been going down up to now. I definitely did not start seeing these issues until I had done some updates on the cluster. Hopefully we can get some answers....

stefanjansson · ‎04-13-2010

OK , if you haven´t seen this messages before you did your updates on your cluster (ESXi hosts) then my theory, that this is something that has arised with newer builds of ESXi , still holds. I have logged a case today so hopefully there will be some light on this matter. Have you logged a case yet ?

Since I haven´t seen any logs on a vm stating that it has problem with diskaccess I´m hoping that this is not affecting operation yet...

If you click on "Ask VMware" in the events window you´ll get redirected to VMware that explains some of the impact. I quote :

"All I/O, metadata operations to the specific volume from COS, user interface (vSphere Client), or virtual machines are internally queued and retried for some duration of time. If the volume or storage device connectivity is not restored within that duration of time, such I/O operations fail. This might have an impact on already running virtual machines as well as any new power on operations by virtual machines, and so on."

So hopefully all the I/O is queued and is restored afterwards....

Are most of your alarms also clockwise like 11.00 ,12.15 ,13.30 ?

regards Stefan

edawg · ‎04-13-2010

I opened a case but at the time they directed me to my storage vendor....In my situation the times are not typically at any set time. I will reopen my case. Make sure you let me know if you find anything out. I will post back with what I find also..

thanks

edawg · ‎04-13-2010

Just noticed something on my esxi hosts that I was wondering if you can check. If you log into the host and browse to /var/log/vmware open the hostd.log file and see if you spot a ton of errors stating status changed to yellow on your datastores. On my hosts all of them were filling up this file every two hours with the exact same errors we are seeing on the vcenter side when the datastores drop for a few seconds. To fix the issue I followed the steps outlined in http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101709... to change my hearbeat time to 30 seconds. After restarting the management agents the yellow errors have completely disappeared. (so far) While this may not be the cause I find it interesting it is the exact same error I am seeing when the problem shows up. Let me know what you find.

stefanjansson · ‎04-14-2010

I ´ve checked hostd logs and I also have these errors filling up the log. But I also have that on the other cluster with no datastore connectivity errors. I believe this is from OS heartbeat monitoring the vm´s through VMware tools. I have not activated that feature in my cluster , have you ? You can see the setting through vCenter "Cluster settings\VMware HA\VM Monitoring"

Did you enter both lines in config.xml file ?

vmsvc>

after that I have enabled it in the cluster. But I´m not sure this line is reflecting this feature.....

I have sent some log files to the support ,but the only thing I have heard is that the case is being elevated to a higher level in the support chain...

regards Stefan

edawg · ‎04-14-2010

VM monitoring in HA was turned on already when I inherited the cluster, but I wasn't looking at the logs until recently to tell if the errors are new. I only entered the following line in the file...

Once I restarted the management agents on the hosts the errors have almost disappeared in the config.xml file. Whether or not this has anything to do with the datastores dropping within VCenter I don't know. If its not related I should be seeing the datastore issue within the next day or so as I have not had it since Monday.

Thank you for keeping me in the loop. I will let you know what happens. Keep me updated on your side.

Regards,

Erik

stefanjansson · ‎04-19-2010

Hi ,

after a few days with uploading log files to the support. It went back that I was having an outdated hba driver corresponding to the supported matrixes with my storage . I´m not sure that this will fix my problems but you know the drill ,before the latest and greatest has been installed ,no deep dive into problems. I have changed to that driver today ,took a while since there is a lot of vmotion back and forth between ESX hosts. There was a reboot involved in this after the upgrade. I will get back with progress after the update.

regards Stefan

edawg · ‎04-19-2010

Definitely let me know. On my side the issue has not occurred since I added the time value to each host. At this point this is the longest I have gone without the issue re-surfacing, so I am hopeful it was the fix.

Good luck

stefanjansson · ‎04-23-2010

Hi ,

I almost began to believe that the upgrade of the HBA driver had fixed the problem for us ,but last night we had another "issue". It had gone almost 77 hours since I changed the driver on the last ESXi host ,and that is the longest time so far between failures since I got aware of this issue. It could be the fact that all the ESXi hosts got a reboot also.....

However the case will continue. One question ,is your cluster still OK after that you changed that parameter in config.xml ,regarding heartbeat status ?

I will point out this fact to the support ,to see what they have to say about that...

Another question ,how many datastores do you have on your cluster ?

regards Stefan

edawg · ‎04-23-2010

Stefan-

Sorry to hear your issue is still happening. Since the change the cluster is working fine. No issues that I can see. We are running 3 Dell R710's in the cluster with 10 iSCSI datatstores being presented to the cluster with around 80 vms. Prior to the change I was seeing the exact same error I was getting in the cluster show up hundreds of times every few hours in config.xml. Since the change I only see a few entries every couple of days. Keep me in the loop.

Budrumi · ‎06-14-2010

Thanks for bumping into my thread Stefan, incidentally I just found this thread earlier today

Currently I can't do much more, because the servers and SAN already have the latest firmwares and as I wrote, when I was rebooting one of the servers that was experiencing the problem I also updated ESXi to the latest build 256968, but it had no effect and the problem persists. I can't also try changing the hearbeat value, since these are standalone ESXi servers without vCenter.

Please keep this thread updated when you learn any new information, it'll be appreciated.

Budrumi · ‎06-14-2010

Btw last night, I've upgraded 2 of our servers to the U2 release (261974).

One of the servers (many VMs running) was exhbiting the issue before and is still exhbiting it after upgrade to U2.

The second server (2 VMs running occasionally) wasn't exhibiting the issue before and neither is after the upgrade.

Beats me....

stefanjansson · ‎06-14-2010

Hi , sorry to hear that. That probably means that my problems won´t disappear either when I upgrade my cluster. When these issues occur at your host ,is it pretty much clockwise ,like 22.18 ,23.48 ,03.18 and so on ? I can see that it is something that probably is scheduled every half hour in my cluster . I would guess that your other host is not having enough workload to experience these issues. I have a test cluster with a few vm´s on it and I haven´t seen these issues on that cluster...

regards Stefan

Budrumi · ‎06-15-2010

That was my line of thinking too .... all hosts with moderate to high load and many VMs exhibit the problem, two servers with lower load/few VMs don't (one of them is with the pre-U1 release 181792, the other is the one I've upgraded tonite to U2).

And about the time .... well it happens at semi-random times (23:30, 1:12, 6:30, 8:05, 8:30 - that's since the upgrade that took place yesterday around 22:30). On other server, it's at different times (18:00, 19:30, 19:45, 20:30, 22:10, 23:30, 6:04, 8:00 etc.).

SurfControl · ‎08-09-2010

anything new on this folks? i'm having the same issue, I have esx 4.0u1vcenter 4.0u1Clariion cx-4 FC

stefanjansson · ‎08-09-2010

Jag har semester,åter 16 aug

I´m on vacation ,will be back on August 16

mvh /regards

// Stefan

Budrumi · ‎08-09-2010

Hi.

Not really. This problem is still there, but it's almost non-noticable due to working failover. Not sure if I've mentioned it in my previous responses, but the issue didn't go away even with build 261974.

Unfortunately didn't have big enough maintenance window (no vMotion here) to upgrade to ESXi 4.1 and check if the problem still persists or not.

edawg · ‎08-09-2010

Sounds like you guys may have a different issue, but I wanted to let you know that the fix found in....http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1017091 has worked for me since the day I put it in place. No repeats. Good luck.