Solved: Re: All disks in one host dropped from vSAN

dmetcalfe92 · ‎03-30-2016

Hi,

I recently had a power cut at home. My 3 node cluster came back okay but I noticed the vSAN was missing a lot of storage.

After investigating, it looks like all the disks in one host have been dropped (all 4 were functional before the incident).

I have tried rebooting the host but this hasn't resolved the issues I'm experiencing.

My fileserver is unable to power on due to this, I think it was resyncing components when the power cut happened.

Because I have data on these disks (my file server), I'm hesitant to re-install ESXi on the host or do anything with the disks

I am due to upgrade my disk format version to 3, but hesitant to do this either at the moment due to my data!

Environment:

- vCenter Server Appliance 6.0 U2 (build 3634794)

- ESXi 6.0 Update 2 (build 3620759) on all hosts

Please could someone advise on what steps I should perform next?

See below for some screenshots and information.

Thanks

The Metadata health check informs me all the disks have failed, however all 4 disks were functioning as expected prior to the power cut:

Looking in the "Virtual SAN Health Status" column, it doesn't show any health information regarding the disks

Comparison to another host:

The host CAN see the disks on it's HBA's, see below:

dmetcalfe92 · ‎04-01-2016

Update:

Managed to find LSOM heap memory size, and the "secret" command to increase it

For my own sanity and to keep track of what I've done, here's what I did.

You can see where I had already changed their values previously from the default one.

Edit:

The command that VMWare wanted to keep so secret is in their archives here:

http://vmware1140.rssing.com/chan-4263175/all_p2889.html

I don't understand why VMware want to keep this command from the community? I'm currently testing now with both heapsizes at maximum config. My host is booting

He originally typed:

Dear Simon,

I have tried:

vsish -e set /config/LSOM/intOpts/heapSize 2047

to increase the heapsize of memory that LSOM has allocated,

then did reboot, with no success on 192.168.240.55,

it still fails to bring up the Disk Group with memory error ...

And changed it to:

Dear Simon,

I have tried the command you gave me in SR to increase the

heapsize of memory that LSOM has allocated, then did reboot,

with no success on 192.168.240.55,

it still fails to bring up the Disk Group with memory error ...

View solution in original post

elerium · ‎03-30-2016

It looks like your setup is using consumer desktop SSDs which generally don't have power loss protection for writes. It's quite possible the power loss has caused data loss for the SSDs where the SSDs have acked the writes but lost the data during the power cut. The failure on the SSD would result in magnetic disks under it to fail.

Is there any resync activity occuring after powerup? If you are able to get into RVC what does vsan.check_state and vsan.disks_stats report back? Also assuming you are using FTT=1? If so I'm kind of surprised the mirror copy isn't working but as you mentioned the power loss was during a resync where a mirror may not have been complete.

dmetcalfe92 · ‎03-31-2016

Hi Elerium,

Thanks for your response. I'd say everything is consumer grade except the HP Microservers & network switch.

Yes all vSAN Storage Policies are set to FTT=1

Outputs from the RVC commands below.

> vsan.check_state /192.168.168.9/Home/computers/G8

2016-03-31 12:42:13 +0000: Step 1: Check for inaccessible VSAN objects

Detected 1 objects to be inaccessible

Detected 261ff456-da9b-5df6-bfc2-002655e3ce94 on g8esxi2.d.l to be inaccessible

2016-03-31 12:42:13 +0000: Step 2: Check for invalid/inaccessible VMs

2016-03-31 12:42:13 +0000: Step 3: Check for VMs for which VC/hostd/vmx are out of sync

Did not find VMs for which VC/hostd/vmx are out of sync

> vsan.disks_stats /192.168.168.9/Home/computers/G8

2016-03-31 12:45:45 +0000: Fetching VSAN disk info from g8esxi3.d.l (may take a moment) ...

2016-03-31 12:45:45 +0000: Fetching VSAN disk info from g8esxi2.d.l (may take a moment) ...

2016-03-31 12:45:45 +0000: Fetching VSAN disk info from g8esxi1.d.l (may take a moment) ...

2016-03-31 12:45:47 +0000: Done fetching VSAN disk infos

+----------------------------------------------------------------------------+-------------+-------+------+------------+---------+----------+---------+

|                                                                            |             |       | Num | Capacity   |         |          | Status |

| DisplayName                                                                | Host        | isSSD | Comp | Total      | Used    | Reserved | Health |

+----------------------------------------------------------------------------+-------------+-------+------+------------+---------+----------+---------+

| t10.ATA_____KINGSTON_SV300S37A60G___________________50026B7741053681____   | g8esxi1.d.l | SSD   | 0    | 55.90 GB   | 0.00 % | 0.00 %   | OK (v2) |

+----------------------------------------------------------------------------+-------------+-------+------+------------+---------+----------+---------+

| t10.ATA_____KINGSTON_SV300S37A60G___________________50026B7745072574____   | g8esxi2.d.l | SSD   | 0    | 55.90 GB   | 0.00 % | 0.00 %   | OK (v2) |

| t10.ATA_____ST31500341AS________________________________________9VS2XQTT   | g8esxi2.d.l | MD    | 35   | 1383.29 GB | 88.35 % | 87.90 % | OK (v2) |

| t10.ATA_____ST500LT0122D1DG142___________________________________S3P6SP0P | g8esxi2.d.l | MD    | 33   | 461.09 GB | 90.17 % | 89.64 % | OK (v2) |

| t10.ATA_____WDC_WD2003FYYS2D02W0B0________________________WD2DWMAY03249303 | g8esxi2.d.l | MD    | 34   | 1844.38 GB | 88.49 % | 87.92 % | OK (v2) |

+----------------------------------------------------------------------------+-------------+-------+------+------------+---------+----------+---------+

| t10.ATA_____M42DCT512M4SSD2__________________________00000000113103155A49 | g8esxi3.d.l | SSD   | 0    | 476.94 GB | 0.00 % | 0.00 %   | OK (v2) |

| t10.ATA_____ST31500341AS________________________________________9VS2WS1C   | g8esxi3.d.l | MD    | 37   | 1383.29 GB | 98.35 % | 95.03 % | OK (v2) |

| t10.ATA_____HGST_HDN724040ALE640__________________________PK2334PBH1N87R   | g8esxi3.d.l | MD    | 51   | 3688.75 GB | 90.28 % | 85.29 % | OK (v2) |

| t10.ATA_____HGST_HUS724020ALA640__________________________PN2134P5GB31HX   | g8esxi3.d.l | MD    | 37   | 1844.38 GB | 96.49 % | 93.22 % | OK (v2) |

+----------------------------------------------------------------------------+-------------+-------+------+------------+---------+----------+---------+

> vsan.disks_stats /192.168.168.9/Home/computers/G8/hosts/g8esxi1.d.l

2016-03-31 12:47:18 +0000: Fetching VSAN disk info from g8esxi1.d.l (may take a moment) ...

2016-03-31 12:47:19 +0000: Done fetching VSAN disk infos

+--------------------------------------------------------------------------+-------------+-------+------+----------+--------+----------+---------+

|                                                                          |             |       | Num | Capacity |        |          | Status |

| DisplayName                                                              | Host        | isSSD | Comp | Total    | Used   | Reserved | Health |

+--------------------------------------------------------------------------+-------------+-------+------+----------+--------+----------+---------+

| t10.ATA_____KINGSTON_SV300S37A60G___________________50026B7741053681____ | g8esxi1.d.l | SSD   | 0    | 55.90 GB | 0.00 % | 0.00 %   | OK (v2) |

+--------------------------------------------------------------------------+-------------+-------+------+----------+--------+----------+---------+

> vsan.check_state /192.168.168.9/Home/computers/G8/hosts/g8esxi1.d.l

2016-03-31 12:47:33 +0000: Step 1: Check for inaccessible VSAN objects

Detected 1 objects to be inaccessible

Detected 261ff456-da9b-5df6-bfc2-002655e3ce94 on to be inaccessible

2016-03-31 12:47:34 +0000: Step 2: Check for invalid/inaccessible VMs

2016-03-31 12:47:34 +0000: Step 3: Check for VMs for which VC/hostd/vmx are out of sync

Did not find VMs for which VC/hostd/vmx are out of sync

arvinnd · ‎04-01-2016

In RVC o/p host g8esxi1.d.l is not listing magnetic device, I think you should investigate if the said host has no disk failure. We should bring back the 3 MD in the host g8esxi1.d

dmetcalfe92 · ‎04-01-2016

Hello,

Yes, it's a strange one. No disks are being reported as failed.

I suspect either faulty configuration on the ESXi host, or faulty SSD.

I have read on the internet if my SSD is faulty and needs replacing, the diskgroup will have to be deleted and start again.

Is this correct?

I'm currently reinstalling ESXi on the host to see if this alleviates the issues.

CHogan · ‎04-01-2016

Could it be this feature of VSAN?

VSAN 6.1 New Feature - Handling of Problematic Disks - CormacHogan.com

http://cormachogan.com

dmetcalfe92 · ‎04-01-2016

Thanks CHogan,

I will run the below on all hosts, keeping monitoring on but disabling the option to unmount.

esxcli system settings advanced set -o /LSOM/lsomSlowDeviceUnmount -i 0

I'll let you know the outcome once I've done this and rebooted the faulty host

Thanks

dmetcalfe92 · ‎04-01-2016

Hi Guys,

Thanks for all your help so far.

It looks like there may be a bug somewhere with Update 2 (unsure if the bug resides in VCSA or ESXi)

I patiently watched the host as it booted.

The SSD initialization took about 20 minutes, then reported the below error as the host moved onto starting other services.

I suspect there's a bug in VCSA where it's not showing the true status of the SSD!

The reason I missed this, is because the host takes around 20 minutes to boot and the error is only shown for a matter of 15-20 seconds as the rest of the host boots

I've gathered all the logs from /var/log, however I'm unsure which log file will tell me more information.

Can anyone advise?

Thanks

dmetcalfe92 · ‎04-01-2016

Possibly 2 bugs actually:

-One where VCSA doesn't report the disk group failed,

-Another in ESXi, a possible memory leak.

See logs below, I have hilighted the points that look interesting.

2016-04-01T12:03:41.404Z cpu0:33033)PLOG: PLOGAnnounceSSD:6570: Successfully added VSAN SSD (t10.ATA_____KINGSTON_SV300S37A60G___________________50026B7741053681____:2) with UUID 5250634d-bf04-c271-ed11-19f6899b3131

2016-04-01T12:03:41.404Z cpu0:33033)VSAN: Initializing SSD: 5250634d-bf04-c271-ed11-19f6899b3131 Please wait...

2016-04-01T12:03:41.404Z cpu1:33146)PLOG: PLOGNotifyDisks:4010: MD 0 with UUID 52c1ab88-7ddc-3453-4e7a-008c68268c3a with state 0 formatVersion 2 backing SSD 5250634d-bf04-c271-ed11-19f6899b3131 notified

2016-04-01T12:03:41.404Z cpu1:33146)PLOG: PLOGNotifyDisks:4010: MD 1 with UUID 52a7e4ac-c6b4-c354-0210-0cf8d552f06f with state 0 formatVersion 2 backing SSD 5250634d-bf04-c271-ed11-19f6899b3131 notified

2016-04-01T12:03:41.404Z cpu1:33146)PLOG: PLOGNotifyDisks:4010: MD 2 with UUID 52562af0-4663-c81d-8daa-48a4b1253c0e with state 0 formatVersion 2 backing SSD 5250634d-bf04-c271-ed11-19f6899b3131 notified

2016-04-01T12:03:41.404Z cpu1:33146)VSANServer: VSANServer_InstantiateServer:2885: Instantiated VSANServer 0x4304680b5b18

2016-04-01T12:03:41.405Z cpu1:33284)Created VSAN Slab RcSsdParentsSlab_0x4306f950de00 (objSize=208 align=64 minObj=2500 maxObj=25000 overheadObj=0 minMemUsage=668k maxMemUsage=6668k)

2016-04-01T12:03:41.406Z cpu1:33284)Created VSAN Slab RcSsdIoSlab_0x4306f950de00 (objSize=65536 align=64 minObj=64 maxObj=25000 overheadObj=0 minMemUsage=4352k maxMemUsage=1700000k)

2016-04-01T12:03:41.406Z cpu1:33284)Created VSAN Slab RcSsdMdBElemSlab_0x4306f950de00 (objSize=32 align=64 minObj=4 maxObj=4096 overheadObj=0 minMemUsage=4k maxMemUsage=264k)

2016-04-01T12:03:41.406Z cpu1:33284)Created VSAN Slab RCInvBmapSlab_0x4306f950de00 (objSize=56 align=64 minObj=10 maxObj=44073 overheadObj=0 minMemUsage=4k maxMemUsage=2800k)

2016-04-01T12:03:41.418Z cpu0:33000)Global: Virsto_CreateInstance:83: INFO: Create new Virsto instance (heapName: virstoInstance_00000000)

2016-04-01T12:03:41.431Z cpu0:33000)Global: Virsto_CreateInstance:83: INFO: Create new Virsto instance (heapName: virstoInstance_00000001)

2016-04-01T12:03:41.444Z cpu0:33000)Global: Virsto_CreateInstance:83: INFO: Create new Virsto instance (heapName: virstoInstance_00000002)

2016-04-01T12:03:41.445Z cpu0:33000)DOM: DOMDisk_GetServer:256: disk-group w/ SSD 5250634d-bf04-c271-ed11-19f6899b3131 on dom/comp server 0

2016-04-01T12:03:41.460Z cpu0:33213)LSOMCommon: SSDLOGLogEnumProgress:1107: Estimated time for recovering 2124854 log blks is 322962 ms device: t10.ATA_____KINGSTON_SV300S37A60G___________________50026B7741053681____:2

2016-04-01T12:04:08.620Z cpu0:33113)NMP: nmp_ResetDeviceLogThrottling:3349: last error status from device mpx.vmhba32:C0:T0:L0 repeated 5 times

2016-04-01T12:20:50.611Z cpu1:33213)WARNING: Heap: 3583: Heap LSOM already at its maximum size. Cannot expand.

2016-04-01T12:20:50.611Z cpu1:33213)WARNING: Heap: 4214: Heap_Align(LSOM, 216/216 bytes, 8 align) failed. caller: 0x41801d13ef1b

2016-04-01T12:20:50.611Z cpu1:33213)LSOM: LSOMSSDEnumCb:214: Finished reading SSD Log: Out of memory

2016-04-01T12:20:50.611Z cpu1:33213)LSOMCommon: SSDLOG_EnumLogHelper:1336: Throttled: Waiting for 1 outstanding reads

2016-04-01T12:20:51.004Z cpu1:33213)LSOM: LSOMRecoveryDispatch:2460: LLOG recovery complete 5250634d-bf04-c271-ed11-19f6899b3131:Processed 0 entries, Recovered 1324501 entries, Took 1029558 ms

2016-04-01T12:20:51.004Z cpu0:33146)WARNING: LSOM: LSOMAddDiskGroupDispatch:7684: Failed to add disk group. SSD 5250634d-bf04-c271-ed11-19f6899b3131: Out of memory

2016-04-01T12:20:51.004Z cpu1:33213)PLOG: PLOGVerifyDiskGroupNotifyCompletion:3925: Notify disk group failed for SSD UUID 5250634d-bf04-c271-ed11-19f6899b3131 :Out of memory was recovery complete ? No

2016-04-01T12:20:51.004Z cpu0:33033)WARNING: PLOG: PLOGCheckRecoveryStatusForOneDevice:6682: Recovery failed for disk 5250634d-bf04-c271-ed11-19f6899b3131

2016-04-01T12:20:51.004Z cpu0:33033)VSAN: Initialization for SSD: 5250634d-bf04-c271-ed11-19f6899b3131 Failed

Can anyone advise further? I think there is something horribly wrong with the SSD, causing a memory leak. Although it could be trying to recover the log blks? I'm not sure what this means though.

I would like to be able to return the SSD & diskgroup back into a manageable state again without deleting it, if someone can advise what the above means and help with a possible fix?

Thanks

dmetcalfe92 · ‎04-01-2016

Update, found this which looks exactly the same as the issues I'm experiencing now (ALMOST TWO YEARS LATER!)

SERIOUS BUG with VSAN ? :: Potential DATALOSS - Willing to PAY $$$$ for freelancer to assist on resc...

Bad news is, this didn't end well at all.

Apparently there's a private command that's supposed to increase the heap memory size, although he states the command is private:

Unfortunately, even after using a private command (provided by engineers of VSAN)

that was supposed to persistently increase heap memory size of VMkernel, in order

to give a chance for the rebuild / replay of SSD log, we had no luck with it and even

with several reboots, the last node did not bring back the disk group...

Would anyone know what this command is? I would like to give it a go.

Thanks

dmetcalfe92 · ‎04-01-2016

Update:

Managed to find LSOM heap memory size, and the "secret" command to increase it

For my own sanity and to keep track of what I've done, here's what I did.

You can see where I had already changed their values previously from the default one.

Edit:

The command that VMWare wanted to keep so secret is in their archives here:

http://vmware1140.rssing.com/chan-4263175/all_p2889.html

I don't understand why VMware want to keep this command from the community? I'm currently testing now with both heapsizes at maximum config. My host is booting

He originally typed:

Dear Simon,

I have tried:

vsish -e set /config/LSOM/intOpts/heapSize 2047

to increase the heapsize of memory that LSOM has allocated,

then did reboot, with no success on 192.168.240.55,

it still fails to bring up the Disk Group with memory error ...

And changed it to:

Dear Simon,

I have tried the command you gave me in SR to increase the

heapsize of memory that LSOM has allocated, then did reboot,

with no success on 192.168.240.55,

it still fails to bring up the Disk Group with memory error ...

dmetcalfe92 · ‎04-01-2016

The previous post I made (marked as correct) fixed my issue.

I would say this is probably a VMWare bug that needs to be corrected.

It's very easily fixed (If same error occurs, increase size then reboot host, then drop size back down again)

My next steps:
-Allow my data to resync over the weekend

-Reboot & ensure everything comes back okay

-Drop the memory sizes back to default

-Reboot again