Re: vSAN all-flash lab failures

OxfordRich · ‎06-08-2020

Hi vSAN folks. I'm building a lab and I'm aware my hardware choice is not officially supported, but also aware that it's a popular choice even with some VMware engineers.

I'm wondering if anybody out there has a VMware vSAN environment running successfully on NUC10i7FNH hardware, or NUC10s in general? Or does anybody recognise the error?

I've set up a 3 node cluster for a small vSAN all-flash eval lab. Lab hardware overview:

3x NUC10i7FNH3
3x Samsung PM883 960GB 2.5" SATA3 Enterprise SSD/Solid State Drive (Capacity Tier)
3x WD Black 250GB SN750 NVMe SSD (Flash Tier)

I've built the cluster and everything looks great for a bit. Then the hosts start to mark their disk group as failed!The errors in the logs which seem relevant:

This occurs at exactly the point when the disk group is marked as Unhealthy, on each host. I know it usually represents a disk error/write failure/faulty media, but in this case all disks have been swapped and this happens on all NUCs. The media itself isn't failing, but for some reason it's returning an I/O error, causing it to be marked as dead:
WARNING: LSOMCommon: IORETRYParentIODoneCB:2219: Throttled: split status I/O error
WARNING: PLOG: PLOGElevWriteMDCb:746: MD UUID 52b7d790-0e5d-a8b2-c290-8db105925979 write failed I/O error

This error is repeated fairly frequently in the logs:
WARNING: NvmeScsi: 149: SCSI opcode 0x1a (0x453a411fe1c0) on path vmhba1:C0:T0:L0 to namespace t10.NVMe____WDS250G3X0C2D00SJG0______________________50E0DE448B441B00 failed with NVMe error status: 0x2
WARNING: translating to SCSI error H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0
WARNING: NvmeScsi: 149: SCSI opcode 0x85 (0x453a40fbc680) on path vmhba1:C0:T0:L0 to namespace t10.NVMe____WDS250G3X0C2D00SJG0______________________50E0DE448B441B00 failed with NVMe error status:
WARNING: translating to SCSI error H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0

Maybe not relevant, but during Boot I can see:
nvme_pcie00580000:NVMEPCIEAdapterInit:446:workaround=0
WARNING: Invalid parameter: vmknvme_client_type -1, set to default value 1.
WARNING: Invalid parameter: vmknvme_io_queue_num 0, set to default value 1.
WARNING: NVMEPSA:2003 Failed to query initiator attributes: Not supported

Here are some things I've tried, in an effort to narrow this down:

This happens with both ESXi 6.7 Update 3, and ESXi 7. (I have built both environments from scratch to test this, no change. Latest updates are applied to both.)
I suspected it was because of some incompatibility with the NVME disk brand, so I have replaced the NVME disks I originally bought (Samsung 970 EVO Plus) with WD Black SN750 in all hosts. No change at all.
I tested with a 4th host, same issue
I upgraded the SSD used for capacity tier to a disk that's on the HCL (Samsung) and the nature of the errors and events logged didn't change, still no joy.
I've tried this on original NUC firmware v37 and the latest v39
I've entirely disabled power management in the BIOS
I've tried UEFI and Legacy boot

I have been looking at this for a week or two now, and it's causing me more grey hair. Does anybody have any ideas, or even better, a NUC10 environment where this works?

TheBobkin · ‎06-09-2020

Hello OxfordRich,

Welcome to Communities.

Can you share more of the vmkernel.log before the Disk-Group is marked as failed?

Do you see any 'UNDERRUN' messages prior to the snippet you noted there?

If so, could be the following issue:

VMware Knowledge Base

Bob

OxfordRich · ‎06-09-2020

Hi, thanks so much for your response. I've not spotted that KB yet, that's a great pointer, thank you.

Unfortunately I have fallen at the first hurdle:

[root@ESXi-1:~] esxcli system module parameters set -m nvme -p io_split=0

Invalid Module name nvme

I'm trying this on ESXi 7.0, maybe the command to achieve this has moved?

I'll go re-check the vmkernel.log and upload a little more, and give that a try and report back.

Rich

TheBobkin · ‎06-09-2020

Hello Rich,

It does indeed look to be gone in 7.0 from checking in lab here - the same settings may have been moved to somewhere in esxcli nvme (which appears to have been expanded) but I have no way of checking where as no labs with NVMe devices - guess I might aim to spoof some.

Bob

GreatWhiteTec · ‎06-09-2020

The command is still in place in 7.0. The reason it may not be recognizing the module may be due to the module not being loaded.

If you run esxcli system module list - you can see all the modules and which ones are enabled/loaded.

I have a fresh ESXi 7.0 host. The nvme module is enabled but not loaded, so the command to set it does not recognize "nvme" as a module.

I'll try to get it to load and report back. I do have all-NVMe on my hosts

GreatWhiteTec · ‎06-09-2020

Did a reboot... nvme module gone. new nvme_pcie appeared. Enabled and loaded.

OxfordRich · ‎06-09-2020

Hi guys, thanks for taking a look at this. I do have the nvme_pcie module loaded, but it doesn't support the same parameters, so I can't try that command via this method.

I would be prepared to build the environment under 6.7 to give this a bash if needed. Here's what I see via NVME_PCIE's parameter list. I have had a reasonable dig around and can't see anything related to split io at the minute.

[root@ESXi-1:~] esxcli system module parameters list -m nvme_pcie

Name Type Value Description

----------------- ---- ----- -----------

nvmePCIEDebugMask int NVMe PCIe driver debug mask

nvmePCIELogLevel int NVMe PCIe driver log level

TheBobkin · ‎06-09-2020

Hello Rich,

Before going down a potential rabbit-hole here, can you either share/PM the vmkernel.log and/or confirm that it is stating the UNDERRUN condition I mentioned?

Bob

OxfordRich · ‎06-09-2020

Hi Bob,

Good point - no, the logs don't show UNDERRUN

I have restarted one of the hosts, and run HCIbench to give the vSAN some work, rather than leaving it over night idling (which would also fail). Attached is the vmkernel.log which contains the following highlights towards the end, when the Disks were failed:

AHCI is really unhappy..

2020-06-09T20:49:48.922Z cpu0:526070)osfs: OSFS_GetMountPointList:3696: mountPoints[0] inUse pid [ vsan], cid 527d400a075db425-2a34056e0cf09036

2020-06-09T20:50:00.247Z cpu10:530622)osfs: OSFS_GetMountPointList:3696: mountPoints[0] inUse pid [ vsan], cid 527d400a075db425-2a34056e0cf09036

2020-06-09T20:50:18.751Z cpu10:526031)vmw_ahci[00000017]: CompletionBottomHalf:Error port=2, PxIS=0x08000000, PxTDF=0xc0,PxSERR=0x00400100, PxCI=0x000001c0, PxSACT=0x000001f8, ActiveTags=0x000001f8

2020-06-09T20:50:18.751Z cpu10:526031)vmw_ahci[00000017]: CompletionBottomHalf:SCSI cmd 0x2a on slot 6 lba=0x895280, lbc=0x80

2020-06-09T20:50:18.751Z cpu10:526031)vmw_ahci[00000017]: CompletionBottomHalf:cfis->command= 0x61

2020-06-09T20:50:18.751Z cpu10:526031)vmw_ahci[00000017]: LogExceptionSignal:Port 2, Signal: --|--|--|--|--|TF|--|--|--|--|--|-- (0x0020) Curr: --|--|--|--|--|--|--|--|--|--|--|-- (0x0000)

2020-06-09T20:50:18.751Z cpu8:524909)vmw_ahci[00000017]: LogExceptionProcess:Port 2, Process: --|--|--|--|--|TF|--|--|--|--|--|-- (0x0020) Curr: --|--|--|--|--|TF|--|--|--|--|--|-- (0x0020)

2020-06-09T20:50:18.751Z cpu8:524909)vmw_ahci[00000017]: ExceptionHandlerWorld:Performing device reset due to Task File Error.

2020-06-09T20:50:18.751Z cpu8:524909)vmw_ahci[00000017]: ExceptionHandlerWorld:hardware stop on slot 0x6, activeTags 0x000001f8

2020-06-09T20:50:18.752Z cpu8:524909)vmw_ahci[00000017]: _IssueComReset:Issuing comreset...

2020-06-09T20:50:18.827Z cpu8:524909)vmw_ahci[00000017]: _IssueComReset:Issuing comreset...

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: ExceptionHandlerWorld:fail a command on slot 4

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: IssueCommand:tag: 3 already active during issue, reissue_flag:1

2020-06-09T20:50:18.835Z cpu7:525052)NMP: nmp_ThrottleLogForDevice:3856: Cmd 0x2a (0x453ffb9d6580, 0) to dev "t10.ATA_____Samsung_SSD_860_QVO_1TB_________________S4CZNF0N368707N_____" on path "vmhba0:C0:T2:L0" Failed:

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: IssueCommand:tag: 5 already active during issue, reissue_flag:1

2020-06-09T20:50:18.835Z cpu7:525052)NMP: nmp_ThrottleLogForDevice:3865: H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0. Act:NONE. cmdId.initiator=0x430366aae5c0 CmdSN 0xcd95

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: IssueCommand:tag: 6 already active during issue, reissue_flag:1

2020-06-09T20:50:18.835Z cpu7:525052)ScsiDeviceIO: 4062: Cmd(0x453ffb9d6580) 0x2a, CmdSN 0xcd95 from world 0 to dev "t10.ATA_____Samsung_SSD_860_QVO_1TB_________________S4CZNF0N368707N_____" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0.

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: IssueCommand:tag: 7 already active during issue, reissue_flag:1

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: IssueCommand:tag: 8 already active during issue, reissue_flag:1

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: ProcessActiveCommands:Commands completed: 0, re-issued: 5

2020-06-09T20:50:18.835Z cpu7:525052)WARNING: PLOG: PLOGPropagateErrorInt:3955: Permanent error event on 52dc7fd0-9c95-5521-eb69-9b82b448d9a2

2020-06-09T20:50:18.835Z cpu11:526001)LSOM: LSOMLogDiskEvent:7628: Disk Event permanent error for MD 52dc7fd0-9c95-5521-eb69-9b82b448d9a2 (t10.ATA_____Samsung_SSD_860_QVO_1TB_________________S4CZNF0N368707N_____:2)

2020-06-09T20:50:18.835Z cpu11:526001)WARNING: LSOM: LSOMEventNotify:7877: vSAN device 52dc7fd0-9c95-5521-eb69-9b82b448d9a2 is under permanent error.

2020-06-09T20:50:18.835Z cpu7:525052)LSOMCommon: IORETRYCompleteIO:483: Throttled: 0x4540016e7940 IO type 304 (WRITE) isOrdered:NO isSplit:YES isEncr:NO since 85 msec status I/O error

2020-06-09T20:50:18.836Z cpu7:525052)WARNING: LSOMCommon: IORETRYParentIODoneCB:2219: Throttled: split status I/O error

2020-06-09T20:50:18.836Z cpu7:525052)WARNING: PLOG: PLOGElevWriteMDCb:746: MD UUID 52dc7fd0-9c95-5521-eb69-9b82b448d9a2 write failed I/O error

2020-06-09T20:50:19.836Z cpu4:525951)PLOG: PLOGElevHandleDeviceError:1024: Elevator for t10.ATA_____Samsung_SSD_860_QVO_1TB_________________S4CZNF0N368707N_____:2 UUID 52dc7fd0-9c95-5521-eb69-9b82b448d9a2 moving to cleanup state

2020-06-09T20:50:19.922Z cpu11:525951)PLOG: PLOGElevTaskComplete:3442: PLOG Elevator exited

TheBobkin · ‎06-10-2020

Hello Rich,

So this is giving a fairly generic hardware failure: