Re: why?: Lost access to volume due to connectivit...

ManiacMark · ‎12-02-2015

I recently did a fresh install of ESXi 6.0.0 2494585 essentials on my HP DL380G7 using the HP image.

I am using LOCAL SAS drives as my Raid5 datastore.

Everything appears to be configured as normal.

With no guests installed I was seeing this event message appear many times throughout the day. I setup 1 guest vm (windows server 2008 R2) and I still see these "lost access" events occurring throughout the day.

The exact events are:

Lost access to volume 56248aa8-e7b72dd9-14fa-d48564790096 (DataStore1Raid5) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. 2015-12-02 11:44:37 PM

and then I immediately get a success

Successfully restored access to volume 56248aa8-e7b72dd9-14fa-d48564790096 (DataStore1Raid5) following connectivity issues. 2015-12-02 11:44:37 PM

The raid is made up of 8 local sas drives with 1GB battery/capacitor-backed cache card and p410i controller. Health status indicates all drives in the raid are healthy. The entire system is showing as healthy.

ESXi is running from attached USB stick.

Scratch folder is located on local raid datastore.

Does anyone have any suggestions on how to resolve or troubleshoot this? Or which logs may point to anything to help solve this? Maybe just a bug in ESXi 6.0.0?

Thanks.

pterlisten · ‎12-02-2015

Hi,

do you face any issues, like hanging VM guests, SCSI disks errors in the VM guest eventlog or high latencies? If not, I suggest to update the drivers (update to 6.0U1 with a HP customized image) and check if the error occurs again.

ManiacMark · ‎12-02-2015

The issue existed before I installed any guests. I watched it for a few days (with no guests) and the event error appeared periodically.

I then installed a guest as a test, and the guest does hang for a second when the event occurs.

When it first occurred I thought it was related to a bad disc in the newly built array I put together. I replaced the disk, the array rebuilt, and there are no other bad disks right now. Yet the error continues.

I guess if there are no other options I will look to do the 6.0update1 which was just recently released as soon as I can.

pterlisten · ‎12-02-2015

Hi,

try to update to 6.0U1. Make sure that you use a HP customized ISO. If the error isn't resolved with this update, you should open a support case. This isn't a normal behaviour.

cjscol · ‎12-03-2015

Do you have a spare drive configured? I have seen this problem on HP servers with P410i controller when a spare drive is configured. Try removing the spare drive and see if that cures the problem.

Calvin Scoltock VCP 2.5, 3.5, 4, 5 & 6 VCAP5-DCD VCAP5-DCA http://pelicanohintsandtips.wordpress.com/blog LinkedIn: https://www.linkedin.com/in/cscoltock

FritzBrause · ‎12-03-2015

The November 6.0 U1 HP image contains the hpsa driver 6.0.0.114.

So as already said before, U1 should improve this.

HawkieMan · ‎12-03-2015

Ok, I must be on the simple side here, but check the cable from the controller to the disk backplane. Also check the battery on the controller

ManiacMark · ‎12-03-2015

I have just updated to 6.0 update1 (build 3073146) using the HP image so we'll see how it goes. I will update this thread.

It looks promising because before this I noticed that my Performance Chart for Disk had non-stop activity on it, even though I had no guest VMs running. Now I only see minimum activity on the Disk chart. Interesting.

Yes, I do have a spare drive configured in the Raid5 config of the p410i. If the update above doesn't solve the issue, my next action will be to remove the spare as some have suggested.

ManiacMark · ‎12-03-2015

I completed the upgrade to 6.0update1 at around 6:15pm tonight.

I'm checking vSphere event log now and I still see the error appearing at 7:20pm, 8:20pm, 9:35pm, 10:50pm, 11:50pm, 1:26am, 2:26am.

I get the "Lost access to volume" and then immediately get "Successfully restored access to volume" within the same second.

It appears to happen every hour (but sometimes a little more than an hour). This leads me to believe it's related to some kind of heartbeat for the storage array.

The VMware knowledge base is down right now so I can't search anything on this topic, but if I highlight the event and click the "Ask VMware" button it tries to search for an article related to esx.problem.vmfs.heartbeat.timedout

ManiacMark · ‎12-04-2015

Since I only have VMware Essentials license I can't seem to open a support ticket with them.

I haven't had a chance to remove the Spare Drive in the raid to try that approach yet.

I did find this KB about the vmfs heartbeat in ESXi 6.0, so I just gave this a try to disable it: Enabling or disabling VAAI ATS heartbeat (2113956)

We'll see what happens I guess....

CollinChaffin · ‎12-04-2015

See my post below - talk about horrible issues. Try the latest build see if like the iSCSI issues we are seeing those "APIs" also mysteriously fix your issues, too. Or, that the driver has been updated.

Re: **WARNING - iSCSI volumes WILL FAIL using build 3073146 other NETWORK FAILURES!!!!**

jcosta · ‎01-03-2016

Did you ever resolve this? I have ESXI 6 build 3247720 with RAID 10 and one spare on a new dell R730 and every time I start a vm on the server the entire server looses the local volume and then one minute later it reconnects and everything starts to work properly.

I haven't tried updating the firmware on the dell server yet.

Thanks,

J

ManiacMark · ‎01-03-2016

I have not tried it, but the best option at this point seems to be to remove the hot spare from the raid config.

See this thread:

Datastore / Disk latency problems with HP ProLiant G7 - HP Smart Array P410i controller " WARNING: L...

jcosta · ‎01-04-2016

ok, I will try that on my dell box. It is so odd....I can make it happen every time I restart my Windows 7 or Windows 8 VDIs.

I am also going to update the firmware and see if that fixes it.

Doesn't happen when I restart the server vdis...only the desktops. If I move the desktops to my synology NAS it works fine and the host never looses the LOCAL storage.

So weird!

Sreejesh_D · ‎01-04-2016

can you see any events in vmkernel similar to following?

ATS Miscompare detected beween test and set HB images at offset XXX on vol YYY

Please find an article on heatbeat issue with the error message mentioned in this thread. Though the article is related to IBM its worth read.

http://www.thevirtualist.org/alert-application-outages-using-vaai-ats-on-vsphere-5-5-update2-vsphere...

jcosta · ‎01-04-2016

Nope! No error messages. Just did a support call with VMWare and they pushed me to Dell because they didn't see anything in the logs.

I am going to do a full firmware update tomorrow and see if that fixes it and if it doesn't I will remove spare.

Here is to typical troubleshooting!

Later, J

jcosta · ‎01-04-2016

I went to the client and removed the hot spare and upgraded the entire server to the latest firmware. Same thing!

I have one Win 7 box with a bunch of snapshots....it is my base for the virtual desktops. When I start that it always looses the connection to the LOCAL Lun.

I cloned it and it fires up no problem without disconnecting the LUN. No clue!

Going to contact Dell but I have a feeling it is the Raid 10. If they can't figure it out then I am going to switch it to a Raid 5 and see if it still happens.

oh2ftu · ‎01-17-2016

I'm seeing the same issue with ESCI 6.0U1 (HP) on my HP DL380G8.

ESXI is installed on an USB-stick, the raid is a 3+3+3 RAID50. I also have a 3x300GB RAID5.

Both arrays report lost access if any hot spare has been assigned, either array-specific or global.

Spare assignment has been done with hpssacli.

The P410i also has a 512MB with BBWC. Naturally, there is a 2nd SFF-cage installed with the needed SAS-expander.

No Smart array advanced pack. All firmwares have been updated with SPP 2015.10.

One time, I even got a "All paths down", when moving data from one array to another.

Any updates on this?

ManiacMark · ‎01-20-2016

I just wanted to confirm that after I removed the Spare drive from the array configuration, the warnings stopped appearing.

Looks like the latest HP driver still has problems when the spare drive is configured in an array.

edmandsj · ‎01-20-2016

has that spare drive ever been in the RAID as an active drive? maybe metadata exists and a fresh clearing of that data will help with this. Do you have a different spare to drop in?

All

why?: Lost access to volume due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.