Re: VSAN issues after power failure

vmsysadmin20111 · ‎02-14-2019

Hi all,

had a power failure in the lab. After the power was restored, I'm having a number of issues with the VSAN cluster (everything is the latest patch levels on ESXi 6.7 U1).

To be honest, this does not instill any confidence in the VSAN ability to withstand the power outages in the data center... I was under impression that VSAN is supposed to be resilient but apparently the power outage can lead to VSAN data corruption?

Why am I seeing continuous checksum errors? How to correlate the the component ID from the vmkernel log to the VM?

In RVC, "vsan.cmmds_find "HA Cluster" -t LSOM_OBJECT" shows that all objects are healthy, but at least one VM is not able to boot or migrate to another storage.

How to fix a checksum error on the vsan.perf.stats_object?

Is it possible to recover or should I just move everything off and rebuild from scratch?

Any insights on the below errors ? Thanks in advance!

1. On the first host, continuous log spam in vmkernel.log about the checksum error:

2019-02-14T14:53:32.317Z cpu1:2098661)LSOM: LSOMReadVerifyChecksum:3700: Throttled: Checksum error detected on component e930655c-045a-ac0c-892a-e89a8f1419a4, comp offset 25192038400 (computed CRC 0xd889192f != saved CRC 0x78138eac (faked: Y))

2019-02-14T14:53:42.429Z cpu2:2098661)LSOM: LSOMReadVerifyChecksum:3700: Throttled: Checksum error detected on component e930655c-045a-ac0c-892a-e89a8f1419a4, comp offset 25192038400 (computed CRC 0xd889192f != saved CRC 0x78138eac (faked: Y))

2. On the second host continuous log spam:

2019-02-14T14:55:23.645Z cpu12:2099088)HBX: 6134: '7db4445c-b4b2-a83f-e3ab-e89a8f141626': HB at offset 3977728 - Marking HB:

2019-02-14T14:55:23.645Z cpu12:2099088) [HB state abcdef04 offset 3977728 gen 1 stampUS 18511797829 uuid 5c46e711-8fc28d84-e985-e89a8f141626 jrnl <FB 384408> drv 14.81 lockImpl 4 ip 192.168.0.32]

2019-02-14T14:55:23.647Z cpu12:2099088)HBX: 6219: '7db4445c-b4b2-a83f-e3ab-e89a8f141626': HB at offset 3977728 - Marked HB:

2019-02-14T14:55:23.647Z cpu12:2099088) [HB state abcdef04 offset 3977728 gen 1 stampUS 18516894357 uuid 5c46e711-8fc28d84-e985-e89a8f141626 jrnl <FB 384408> drv 14.81 lockImpl 4 ip 192.168.0.32]

2019-02-14T14:55:23.647Z cpu12:2099088)FS3J: 4239: Replaying journal at <type 1 addr 384408>, gen 1

2019-02-14T14:55:23.666Z cpu1:2098656)LSOM: LSOMReadVerifyChecksum:3700: Throttled: Checksum error detected on component 93b4445c-e873-9117-75a9-e89a8f141626, comp offset 25192038400 (computed CRC 0xd889192f != saved CRC 0x73696c5f (faked: N))

2019-02-14T14:55:23.666Z cpu12:2099088)WARNING: HBX: 5440: Replay of journal <type 1 addr 384408> on vol '7db4445c-b4b2-a83f-e3ab-e89a8f141626' failed: Checksum mismatch

3. "VSAN Health Alarm 'Stats Master Election'" is triggered. I'm unable to disable the VSAN Performance Service ("General VSAN error") from the GUI; also tried rvc without any success.

/localhost/Datacenter/computers> vsan.perf.stats_object_info "HA Cluster"

Directory Name: .vsan.stats

vSAN Object UUID: 7db4445c-b4b2-a83f-e3ab-e89a8f141626

SPBM Profile: vSAN Storage Policy - RAID 1

vSAN Policy: proportionalCapacity: 0, forceProvisioning: 0, hostFailuresToTolerate: 1, iopsLimit: 0, spbmProfileGenerationNumber: 0, checksumDisabled: 0, stripeWidth: 2, cacheReservation: 0, replicaPreference: Performance, subFailuresToTolerate: 0, spbmProfileId: 7d68c4d1-d3f1-46e2-ad03-282dc97384d3

vSAN Object Health: healthy

2019-02-14 14:52:19 +0000: Fetching vSAN disk info from esx01(may take a moment) ...

2019-02-14 14:52:19 +0000: Fetching vSAN disk info from 192.168.0.30 (may take a moment) ...

2019-02-14 14:52:19 +0000: Fetching vSAN disk info from esx02 (may take a moment) ...

2019-02-14 14:52:21 +0000: Done fetching vSAN disk infos

DOM Object: 7db4445c-b4b2-a83f-e3ab-e89a8f141626 (v7, owner: esx02, proxy owner: None, policy: subFailuresToTolerate = 0, stripeWidth = 2, cacheReservation = 0, locality = None, SCSN = 58, spbmProfileName = vSAN Storage Policy - RAID 1, CSN = 52, proportionalCapacity = 0, checksumDisabled = 0, replicaPreference = Performance, spbmProfileGenerationNumber = 0, hostFailuresToTolerate = 1, spbmProfileId = 7d68c4d1-d3f1-46e2-ad03-282dc97384d3, forceProvisioning = 0, iopsLimit = 0)

RAID_1

RAID_0

Component: 93b4445c-a4ee-8d17-92d2-e89a8f141626 (state: ACTIVE (5), host: esx02, capacity: naa.5000cca224d591b5, cache: naa.5001b444a4568095,

votes: 2, usage: 0.3 GB, proxy component: false)

Component: 93b4445c-e873-9117-75a9-e89a8f141626 (state: ACTIVE (5), host: esx02, capacity: naa.50014ee25c708d1e, cache: naa.5001b444a4568095,

votes: 1, usage: 0.3 GB, proxy component: false)

RAID_0

Component: 8f30475c-58f8-29df-d8a5-e89a8f141626 (state: ACTIVE (5), host: esx01, capacity: naa.50014ee2b205fee8, cache: naa.5001b444a45674c7,

votes: 1, usage: 0.3 GB, proxy component: false)

Component: e930655c-045a-ac0c-892a-e89a8f1419a4 (state: ACTIVE (5), host: esx01, capacity: naa.5000cca224d5b6d1, cache: naa.5001b444a45674c7,

votes: 1, usage: 0.3 GB, proxy component: false)

Witness: 8f30475c-14bb-2fdf-d4a4-e89a8f141626 (state: ACTIVE (5), host: 192.168.0.30, capacity: mpx.vmhba1:C0:T1:L0, cache: mpx.vmhba1:C0:T2:L0,

votes: 2, usage: 0.0 GB, proxy component: false)

Extended attributes:

Address space: 273804165120B (255.00 GB)

Object class: vmnamespace

Object path: /vmfs/volumes/vsan:52a67f471548ec33-2da28b6e2a8cdf54/.vsan.stats

Object capabilities: NONE

/localhost/Datacenter/computers> vsan.perf.stats_object_delete "HA Cluster"

Deleting vSAN Stats DB object, which will stop vSAN Performance Service ...

Task: Disable vSAN performance service

New progress: 1%

Task result: error

TheBobkin · ‎02-14-2019

Hello vmsysadmin201110141

"had a power failure in the lab. After the power was restored, I'm having a number of issues with the VSAN cluster

Are you sure you were having no issues before the outage that went unnoticed? Was all data healthy and compliant with their Storage Policies? Was there an ongoing resync (or reactive rebalance)? How did you shut down the cluster or was everything cold (and if so, all at once for all nodes)? How did you bring the cluster back up? (e.g. all at once, any in MM and/or any with a non-0 vSAN DecomState)

"VSAN ability to withstand the power outages in the data center"

Is all your hardware on the vSAN HCL and correct config? (e.g. nothing that could cause data read/write data inconsistencies like controller or disk configuration). Note that clusters in datacenters (or any decent set-up) typically have at least a few minutes worth of juice in case of outage.

"Why am I seeing continuous checksum errors? How to correlate the the component ID from the vmkernel log to the VM?"

Are you seeing corrected checksum or uncorrectable errors? (guessing the latter if Objects are unwrite/readable)

Are you seeing any read errors on disk on any hosts? (#grep -E '0x3 |0x4 ' /var/log/vmkernel.log)

Components are reported in /var/log/vobd.log - easiest is to do a grep, awk '{print $X}', sort, uniq to get the list of component UUIDs and then there are a multitude of ways you can identify the Objects they belong to but as it is a 6.7 cluster just use #esxcli vsan debug object list and write it to file (or less) and find them there - you can also see if for instance all reported components are on the same disk etc. .

Stats Object can be recreated ( could try objtool deletion first - be 100% sure of the identity of any Object before using this method).

Bob

vmsysadmin20111 · ‎02-17-2019

Hi Bob,

thank you for your reply! The hardware is not on HCL since it's a lab. It is disappointing to hear that VSAN is unable to handle power loss without the battery-backed RAID controller. These hosts were running HP StoreVirtual VSA prior to migration to VSAN, and through many power failures never had any data corruption. The VSAN state was healthy prior to power failure.

I was able to delete the VSAN performance data object and then re-enabled the Performance Service, thanks for the info about the objtool and "esxcli vsan debug object list" command. This fixed the issue with the performance service alarm.

To answer your questions:

- there are no read errors on disk on any hosts, vmkernel.log is clean

- all objects are reported as healthy (even the performance data object was healthy)

I'm not able to recover the VM that had the data corruption issue, storage migration fails half way through. From the vobd.log:

2019-02-17T08:12:19.930Z: [vSANCorrelator] 253531632381us: [vob.vsan.dom.unrecoverableerror] vSAN detected an unrecoverable medium or checksum error for component 1a7e445c-bcda-62bc-2517-e89a8f1419a4 on disk group 523baebc-346d-4dd8-26a3-1485e89fd16f.

The object is reported as healthy:

Object UUID: 1a7e445c-70bd-8dbb-54c5-e89a8f1419a4

Version: 7

Health: healthy

Owner: esx02

Size: 28.00 GB

Used: 17.70 GB

Policy:

checksumDisabled: 0

stripeWidth: 2

iopsLimit: 0

spbmProfileGenerationNumber: 3

proportionalCapacity: 100

SCSN: 16

forceProvisioning: 0

spbmProfileId: aa6d5a82-1c88-45da-85d3-3d74b91a5bad

hostFailuresToTolerate: 0

spbmProfileName: vSAN Default Storage Policy

cacheReservation: 0

CSN: 16

Configuration:

RAID_0

Component: 1a7e445c-bcda-62bc-2517-e89a8f1419a4

Component State: ACTIVE, Address Space(B): 15032385536 (14.00GB), Disk UUID: 5226af95-2bd1-94b4-0654-cb90cf1549b2, Disk Name: naa.50014ee25c708d1e:2

Votes: 2, Capacity Used(B): 15342764032 (14.29GB), Physical Capacity Used(B): 9504292864 (8.85GB), Host Name: esx02

Component: 1a7e445c-50e0-64bc-65a7-e89a8f1419a4

Component State: ACTIVE, Address Space(B): 15032385536 (14.00GB), Disk UUID: 52ef03f8-b77e-e338-5983-515d6c8a48c3, Disk Name: naa.5000cca224d591b5:2

Votes: 1, Capacity Used(B): 15342764032 (14.29GB), Physical Capacity Used(B): 9504292864 (8.85GB), Host Name: esx02

Type: vdisk

Path: /vmfs/volumes/vsan:52a67f471548ec33-2da28b6e2a8cdf54/057e445c-50b6-6dfb-3b44-e89a8f141626/NSX-controller-1.vmdk (Exists)

Group UUID: 057e445c-50b6-6dfb-3b44-e89a8f141626

Directory Name: N/A

TheBobkin · ‎02-17-2019

Hello vmsysadmin201110141

"The hardware is not on HCL since it's a lab. It is disappointing to hear that VSAN is unable to handle power loss without the battery-backed RAID controller. These hosts were running HP StoreVirtual VSA prior to migration to VSAN, and through many power failures never had any data corruption. The VSAN state was healthy prior to power failure."

Sorry but this is false - battery or non-battery backed controller (e.g. HBA330) is irrelevant here - vSAN is well capable of enduring power-outages without issues as I see frequently in GSS, funnily enough most of the time customers call us after an outage is due to their switch still having issues (or rolled back settings) and the cluster is partitioned/flapping. Comparing clusters running on unsupported hardware to what we actually certify is not a valid comparison as you are adding a veritable ton of uncontrolled variables. I do not have enough fingers to count on the issues I have seen in Whitebox/Workstation/HOL labs that I have never once seen in supported environments, this is the reason we test and certify components ourselves and have the vSAN HCL.

That other affected Object is FTT=0 and thus if any component of it is unrecoverably damaged it is essentially gone, if you have two other controllers configured and thus still have majority you can of course redeploy it:

Redeploy an NSX Controller

Bob

JohnNicholsonVM · ‎02-18-2019

for vSAN you actually in a perfect world don't want a battery backed RAID controller, you want a fast dumb pass through HBA (like the HBA 330+).

In the case of HCL complient drives, the problem with cheap consumer drives is they ACK writes to a DRAM buffer and lack the full capacitors to protect upper and lower pages (power loss in middle of defrag can cause loss of cold data).

All

VSAN issues after power failure