VMware Cloud Community
muser12
Contributor
Contributor

lost access to volume

I am running ESXi on 3 different machines at high load (cpu + disk), and I am encountering the following error event:

Lost access to volume GUID (XXX) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. 3/21/2015 3:48:57 PM

shortly thereafter, access is restored:

Successfully restored access to volume  GUID (XXX) following  connectivity issues. 3/21/2015 3:49:24 PM

Interestingly, these events occur exactly every 6 hours on each affected machine:

2015-03-21T13:39:01.303Z cpu30:32857)HBX: 270: Reclaimed heartbeat for volume GUID (XXX): [Timeout] Offset 3796992

2015-03-21T19:39:13.824Z cpu20:32855)HBX: 270: Reclaimed heartbeat for volume GUID (XXX): [Timeout] Offset 3796992

2015-03-22T01:39:09.569Z cpu0:32856)HBX: 270: Reclaimed heartbeat for volume GUID (XXX): [Timeout] Offset 3796992

Most of the search results (and the VMware KB) discuss issues related to FC/iSCSI/network connected datastores. However, these datastores are local disks connected to a MegaRAID SAS controller.

The fact that this occurs every 6 hours made me think there was some sort of cronjob or the like that was running and causing a whole bunch of disk churn, which, combined with an already high disk load, was clogging up the controller. However, I can't find any such cronjob in /var/spool/cron/crontabs/root. I've checked several logs in /var/log, and nothing is jumping out anywhere around those time frames. I've updated to the latest ESXi patches, but that didn't help. Any ideas?

FWIW, my hw/sw is:

ESXi 5.5.0, 2456374

SuperMicro X10-DRHC

MegaRaid SAS Invader Controller (LSI 3108 SAS3)

3 consumer SSD drives in RAID0 (for the affected datastore)

24 Replies
J1mbo
Virtuoso
Virtuoso

Cannoli‌ - did you ever find a solution to the lsi_mr3: fusionWaitForOutstanding and 30-40s IO pause? We are seeing this with current driver (6.903.85.00-1OEM.600.0.0.2768847) when array is under some load.

0 Kudos
dragoangel
Contributor
Contributor

Have you find solution? I had the same situation, home lab, Supermicro  with LSI 3108 SAS3 Invader RAID. Lost access to volume due to connectivity issues by 1-3 sec, and after it restores.

Second thing: in monitoring of ESXi I see log that one of HDDs in RAID is recovery aborted, but at Supermicro IPMI at storage monitoring all fine and at "optimal" state.

0 Kudos
m4ntic0r
Enthusiast
Enthusiast

i had an similar issue on my Supermicro X10SDV-TLN4F board and an LSI 9271-8i controller with 2x500GB Samsung 850 Evo in raid1 and 4x4TB HGST Megascale on raid10.

I got the following messages when my raid was under load:

Successfully restored access to volume 599fdec9-e7062b2a-3755-ac1f6b1708cc (S1-RAID1) following connectivity issues.

Successfully restored access to volume 599fdeb2-5c0475e0-dbb7-ac1f6b1708cc (S2-RAID10) following connectivity issues.

Lost access to volume 599fdec9-e7062b2a-3755-ac1f6b1708cc (S1-RAID1) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Lost access to volume 599fdeb2-5c0475e0-dbb7-ac1f6b1708cc (S2-RAID10) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

But then i changed my raid config for the ssd-raid1 from write through to write back (had write through because of my previous installation on windows and lsi fast path settings)

and now i did not get these errors, even under load. i added one 512gb ssd as read cache cade 2.0 for my raid10, too.

what settings are you using?

dragoangel
Contributor
Contributor

thanks for suggestion, i'll check in the monday.

0 Kudos
dragoangel
Contributor
Contributor

About write-back - it was already set to that mode.

I do not have any cache because I to buy like you SSD even same model, but they died after half year.

I already update RAID controller firmware to latest from Supermicro FTP 4.650.00-6223 ftp://ftp.supermicro.com/Driver/SAS/LSI/3108 it work better now but still errors.

And installed LSI Provider from Managment tools, it good to see status of raid, but MegaRAID Storage Manager from Windows can find ESXi 1 time of 20 =D. Because of it I install StorCLI.

0 Kudos