VMware Cloud Community
OxfordRich
Contributor
Contributor

vSAN all-flash lab failures

Hi vSAN folks. I'm building a lab and I'm aware my hardware choice is not officially supported, but also aware that it's a popular choice even with some VMware engineers.

I'm wondering if anybody out there has a VMware vSAN environment running successfully on NUC10i7FNH hardware, or NUC10s in general? Or does anybody recognise the error?

I've set up a 3 node cluster for a small vSAN all-flash eval lab. Lab hardware overview:

  • 3x NUC10i7FNH3
  • 3x Samsung PM883 960GB 2.5" SATA3 Enterprise SSD/Solid State Drive (Capacity Tier)
  • 3x WD Black 250GB SN750 NVMe SSD (Flash Tier)

I've built the cluster and everything looks great for a bit. Then the hosts start to mark their disk group as failed!The errors in the logs which seem relevant:

  • This occurs at exactly the point when the disk group is marked as Unhealthy, on each host. I know it usually represents a disk error/write failure/faulty media, but in this case all disks have been swapped and this happens on all NUCs. The media itself isn't failing, but for some reason it's returning an I/O error, causing it to be marked as dead:
    WARNING: LSOMCommon: IORETRYParentIODoneCB:2219: Throttled: split status I/O error
    WARNING: PLOG: PLOGElevWriteMDCb:746: MD UUID 52b7d790-0e5d-a8b2-c290-8db105925979 write failed I/O error

  • This error is repeated fairly frequently in the logs:
    WARNING: NvmeScsi: 149: SCSI opcode 0x1a (0x453a411fe1c0) on path vmhba1:C0:T0:L0 to namespace t10.NVMe____WDS250G3X0C2D00SJG0______________________50E0DE448B441B00 failed with NVMe error status: 0x2
    WARNING: translating to SCSI error H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0
    WARNING: NvmeScsi: 149: SCSI opcode 0x85 (0x453a40fbc680) on path vmhba1:C0:T0:L0 to namespace t10.NVMe____WDS250G3X0C2D00SJG0______________________50E0DE448B441B00 failed with NVMe error status:
    WARNING: translating to SCSI error H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0

  • Maybe not relevant, but during Boot I can see:
    nvme_pcie00580000:NVMEPCIEAdapterInit:446:workaround=0
    WARNING: Invalid parameter: vmknvme_client_type -1, set to default value 1.
    WARNING: Invalid parameter: vmknvme_io_queue_num 0, set to default value 1.
    WARNING: NVMEPSA:2003 Failed to query initiator attributes: Not supported

Here are some things I've tried, in an effort to narrow this down:

  • This happens with both ESXi 6.7 Update 3, and ESXi 7. (I have built both environments from scratch to test this, no change. Latest updates are applied to both.)
  • I suspected it was because of some incompatibility with the NVME disk brand, so I have replaced the NVME disks I originally bought (Samsung 970 EVO Plus) with WD Black SN750 in all hosts. No change at all.
  • I tested with a 4th host, same issue
  • I upgraded the SSD used for capacity tier to a disk that's on the HCL (Samsung) and the nature of the errors and events logged didn't change, still no joy.
  • I've tried this on original NUC firmware v37 and the latest v39
  • I've entirely disabled power management in the BIOS
  • I've tried UEFI and Legacy boot

I have been looking at this for a week or two now, and it's causing me more grey hair. Does anybody have any ideas, or even better, a NUC10 environment where this works?

Tags (1)
Reply
0 Kudos
14 Replies
TheBobkin
Champion
Champion

Hello OxfordRich​,

Welcome to Communities.

Can you share more of the vmkernel.log before the Disk-Group is marked as failed?

Do you see any 'UNDERRUN' messages prior to the snippet you noted there?

If so, could be the following issue:

VMware Knowledge Base

Bob

Reply
0 Kudos
OxfordRich
Contributor
Contributor

Hi, thanks so much for your response. I've not spotted that KB yet, that's a great pointer, thank you.

Unfortunately I have fallen at the first hurdle:

[root@ESXi-1:~] esxcli system module parameters set -m nvme -p io_split=0

Invalid Module name nvme

I'm trying this on ESXi 7.0, maybe the command to achieve this has moved?

I'll go re-check the vmkernel.log and upload a little more, and give that a try and report back.

Rich

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Rich,

It does indeed look to be gone in 7.0 from checking in lab here - the same settings may have been moved to somewhere in esxcli nvme (which appears to have been expanded) but I have no way of checking where as no labs with NVMe devices - guess I might aim to spoof some.

Bob

Reply
0 Kudos
GreatWhiteTec
VMware Employee
VMware Employee

The command is still in place in 7.0. The reason it may not be recognizing the module may be due to the module not being loaded.

If you run esxcli system module list - you can see all the modules and which ones are enabled/loaded.

I have a fresh ESXi 7.0 host. The nvme module is enabled but not loaded, so the command to set it does not recognize "nvme" as a module.

I'll try to get it to load and report back. I do have all-NVMe on my hosts

Reply
0 Kudos
GreatWhiteTec
VMware Employee
VMware Employee

Did a reboot... nvme module gone. new nvme_pcie appeared. Enabled and loaded.

Reply
0 Kudos
OxfordRich
Contributor
Contributor

Hi guys, thanks for taking a look at this. I do have the nvme_pcie module loaded, but it doesn't support the same parameters, so I can't try that command via this method.

I would be prepared to build the environment under 6.7 to give this a bash if needed. Here's what I see via NVME_PCIE's parameter list. I have had a reasonable dig around and can't see anything related to split io at the minute.

[root@ESXi-1:~] esxcli system module parameters list -m nvme_pcie

Name               Type  Value  Description

-----------------  ----  -----  -----------

nvmePCIEDebugMask  int          NVMe PCIe driver debug mask

nvmePCIELogLevel   int          NVMe PCIe driver log level

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Rich,

Before going down a potential rabbit-hole here, can you either share/PM the vmkernel.log and/or confirm that it is stating the UNDERRUN condition I mentioned?

Bob

Reply
0 Kudos
OxfordRich
Contributor
Contributor

Hi Bob,

Good point - no, the logs don't show UNDERRUN

I have restarted one of the hosts, and run HCIbench to give the vSAN some work, rather than leaving it over night idling (which would also fail). Attached is the vmkernel.log which contains the following highlights towards the end, when the Disks were failed:

AHCI is really unhappy..

2020-06-09T20:49:48.922Z cpu0:526070)osfs: OSFS_GetMountPointList:3696: mountPoints[0] inUse pid [    vsan], cid 527d400a075db425-2a34056e0cf09036

2020-06-09T20:49:48.922Z cpu0:526070)osfs: OSFS_GetMountPointList:3696: mountPoints[0] inUse pid [    vsan], cid 527d400a075db425-2a34056e0cf09036

2020-06-09T20:50:00.247Z cpu10:530622)osfs: OSFS_GetMountPointList:3696: mountPoints[0] inUse pid [    vsan], cid 527d400a075db425-2a34056e0cf09036

2020-06-09T20:50:18.751Z cpu10:526031)vmw_ahci[00000017]: CompletionBottomHalf:Error port=2, PxIS=0x08000000, PxTDF=0xc0,PxSERR=0x00400100, PxCI=0x000001c0, PxSACT=0x000001f8, ActiveTags=0x000001f8

2020-06-09T20:50:18.751Z cpu10:526031)vmw_ahci[00000017]: CompletionBottomHalf:SCSI cmd 0x2a on slot 6 lba=0x895280, lbc=0x80

2020-06-09T20:50:18.751Z cpu10:526031)vmw_ahci[00000017]: CompletionBottomHalf:cfis->command= 0x61

2020-06-09T20:50:18.751Z cpu10:526031)vmw_ahci[00000017]: LogExceptionSignal:Port 2, Signal:  --|--|--|--|--|TF|--|--|--|--|--|-- (0x0020) Curr: --|--|--|--|--|--|--|--|--|--|--|-- (0x0000)

2020-06-09T20:50:18.751Z cpu8:524909)vmw_ahci[00000017]: LogExceptionProcess:Port 2, Process: --|--|--|--|--|TF|--|--|--|--|--|-- (0x0020) Curr: --|--|--|--|--|TF|--|--|--|--|--|-- (0x0020)

2020-06-09T20:50:18.751Z cpu8:524909)vmw_ahci[00000017]: ExceptionHandlerWorld:Performing device reset due to Task File Error.

2020-06-09T20:50:18.751Z cpu8:524909)vmw_ahci[00000017]: ExceptionHandlerWorld:hardware stop on slot 0x6, activeTags 0x000001f8

2020-06-09T20:50:18.752Z cpu8:524909)vmw_ahci[00000017]: _IssueComReset:Issuing comreset...

2020-06-09T20:50:18.827Z cpu8:524909)vmw_ahci[00000017]: _IssueComReset:Issuing comreset...

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: ExceptionHandlerWorld:fail a command on slot 4

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: IssueCommand:tag: 3 already active during issue, reissue_flag:1

2020-06-09T20:50:18.835Z cpu7:525052)NMP: nmp_ThrottleLogForDevice:3856: Cmd 0x2a (0x453ffb9d6580, 0) to dev "t10.ATA_____Samsung_SSD_860_QVO_1TB_________________S4CZNF0N368707N_____" on path "vmhba0:C0:T2:L0" Failed:

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: IssueCommand:tag: 5 already active during issue, reissue_flag:1

2020-06-09T20:50:18.835Z cpu7:525052)NMP: nmp_ThrottleLogForDevice:3865: H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0. Act:NONE. cmdId.initiator=0x430366aae5c0 CmdSN 0xcd95

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: IssueCommand:tag: 6 already active during issue, reissue_flag:1

2020-06-09T20:50:18.835Z cpu7:525052)ScsiDeviceIO: 4062: Cmd(0x453ffb9d6580) 0x2a, CmdSN 0xcd95 from world 0 to dev "t10.ATA_____Samsung_SSD_860_QVO_1TB_________________S4CZNF0N368707N_____" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0.

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: IssueCommand:tag: 7 already active during issue, reissue_flag:1

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: IssueCommand:tag: 8 already active during issue, reissue_flag:1

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: ProcessActiveCommands:Commands completed: 0, re-issued: 5

2020-06-09T20:50:18.835Z cpu7:525052)WARNING: PLOG: PLOGPropagateErrorInt:3955: Permanent error event on 52dc7fd0-9c95-5521-eb69-9b82b448d9a2

2020-06-09T20:50:18.835Z cpu11:526001)LSOM: LSOMLogDiskEvent:7628: Disk Event permanent error for MD 52dc7fd0-9c95-5521-eb69-9b82b448d9a2 (t10.ATA_____Samsung_SSD_860_QVO_1TB_________________S4CZNF0N368707N_____:2)

2020-06-09T20:50:18.835Z cpu11:526001)WARNING: LSOM: LSOMEventNotify:7877: vSAN device 52dc7fd0-9c95-5521-eb69-9b82b448d9a2 is under permanent error.

2020-06-09T20:50:18.835Z cpu7:525052)LSOMCommon: IORETRYCompleteIO:483: Throttled:  0x4540016e7940 IO type 304 (WRITE) isOrdered:NO isSplit:YES isEncr:NO since 85 msec status I/O error

2020-06-09T20:50:18.836Z cpu7:525052)WARNING: LSOMCommon: IORETRYParentIODoneCB:2219: Throttled: split status I/O error

2020-06-09T20:50:18.836Z cpu7:525052)WARNING: PLOG: PLOGElevWriteMDCb:746: MD UUID 52dc7fd0-9c95-5521-eb69-9b82b448d9a2 write failed I/O error

2020-06-09T20:50:19.836Z cpu4:525951)PLOG: PLOGElevHandleDeviceError:1024: Elevator for t10.ATA_____Samsung_SSD_860_QVO_1TB_________________S4CZNF0N368707N_____:2 UUID 52dc7fd0-9c95-5521-eb69-9b82b448d9a2  moving to cleanup state

2020-06-09T20:50:19.922Z cpu11:525951)PLOG: PLOGElevTaskComplete:3442: PLOG Elevator exited

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Rich,

So this is giving a fairly generic hardware failure:

2020-06-09T20:50:18.751Z cpu8:524909)vmw_ahci[00000017]: ExceptionHandlerWorld:Performing device reset due to Task File Error.

2020-06-09T20:50:18.751Z cpu8:524909)vmw_ahci[00000017]: ExceptionHandlerWorld:hardware stop on slot 0x6, activeTags 0x000001f8

2020-06-09T20:50:18.752Z cpu8:524909)vmw_ahci[00000017]: _IssueComReset:Issuing comreset...

2020-06-09T20:50:18.827Z cpu8:524909)vmw_ahci[00000017]: _IssueComReset:Issuing comreset...

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: ExceptionHandlerWorld:fail a command on slot 4

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: IssueCommand:tag: 3 already active during issue, reissue_flag:1

2020-06-09T20:50:18.835Z cpu7:525052)NMP: nmp_ThrottleLogForDevice:3856: Cmd 0x2a (0x453ffb9d6580, 0) to dev "t10.ATA_____Samsung_SSD_860_QVO_1TB_________________S4CZNF0N368707N_____" on path "vmhba0:C0:T2:L0" Failed:

2020-06-09T20:50:18.835Z cpu8:524909)vmw_ahci[00000017]: IssueCommand:tag: 5 already active during issue, reissue_flag:1

2020-06-09T20:50:18.835Z cpu7:525052)NMP: nmp_ThrottleLogForDevice:3865: H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0. Act:NONE. cmdId.initiator=0x430366aae5c0 CmdSN 0xcd95

Type Code Name Description

Host Status [0x0] OK This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.

Device Status [0x2] CHECK_CONDITION This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.

Plugin Status [0x0] GOOD No error. (ESXi 5.x / 6.x only)

Sense Key [0x4] HARDWARE ERROR

Additional Sense Data 44/00 INTERNAL TARGET FAILURE

https://www.virten.net/vmware/vmware-esxi-scsi-sense-code-decoder-v2/?scsiCode=H%3A0x0+D%3A0x2+P%3A0...

What is potentially of note is that it appears to be using vmw_ahci driver family as opposed to what I would have expected - native nvme_pcie driver - which I would have assumed would be used when vendor/device specific ones were not available (e.g. intel-nvme) - if you look at the vSAN/ESXi HCL for Samsung NVMe devices they use 'nvme' up until 7.0 when they use 'nvme_pcie'.

Bob

Reply
0 Kudos
OxfordRich
Contributor
Contributor

Thanks Bob, some more excellent points!

Any idea if it's possible to change the driver in use from vmw_ahci to nvme_pcie ?

Rich

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Rich,

Did you upgrade this cluster to 6.7 or clean install?

I ask as if you look on reddit and elsewhere a load of home-labbers have encountered issues with whitebox NVMe components - initial issues being in later versions of 6.7 and having to use 6.7 GA inbox driver and/or hacky workarounds to get these working and then in 7.0 other issues due to how they upgraded the cluster as William Lam covers here:

https://www.virtuallyghetto.com/2020/04/important-nvme-ssd-not-found-after-upgrading-to-esxi-7-0.htm...

Bob

Reply
0 Kudos
OxfordRich
Contributor
Contributor

Hi Bob, yes I have read those. This was a fresh install initially of 6.7 U3, and when I faced these exact same issues, i did another fresh build of ESXi 7.0 and vCenter 7

I actually wouldn't mind trying an NVME from the HCL, I just can't be certain it's not an issue with a storage component on the NUC board itself, or somethng in intel's BIOS. Not really sure where to go with it next.

Reply
0 Kudos
gglaccum
Contributor
Contributor

Hi there, are you still having these issues?

I purchased 2 of these NUCs with the samsung evo 970 for a home lab. Didn't /notice/ anything wrong, and recently purchased a third with a slightly different nvme, and, also 3 x 1TB samsung SSDs to go with it.

I also combined them all with vcenter, and ran the updates for esxi against the hosts....., so in theory running whatever the latest esxi 7 is.

Now, I noticed errors on the machines...

The first two were 'loaded' and not the third. So I migrated stuff off of the third, and installed a new centos 8 VM with iozone (my preferred disk load generator).

I ran IOzone on this VM on datastore 2 (SSD) and then on datastore 1 (NVME) without any noticed errors.

I then migrated the VM to the first esxi, immediately the iozone ran into errors.

Migrated to esxi2, and immediately errors.

I then migrated back to esxi3, and no errors.

So I have 3 NUCs, with 'identical' hardware, apart from the NVME (the SSDs are identical). In theory the MB etc. should be identical (same model NUC, but sometimes manufacturers do change components).

In theory the /installed/ software is also identical. The first two NUCs however did have autopartitionsize changed (to a 40GB OS partition for esxi), as they are smaller disks (250GB compared to a 1TB NVME for the third)

Thoughts on a good way of fully getting the inventory of the hardware and software to compare/contrast?

Reply
0 Kudos
gglaccum
Contributor
Contributor

esxcfg-info dumps a load of info, not all of it particularly targeted.

One item which did pop-out, the two 'dodgy' ones have vFAT module loaded, the stable one has vFLASH

However, I am rapidly becoming of the opinion that there are two issues.

There is an annoying 'error' message, which I think is basically a warning. (for example there used to be similar warnings when SATA disks first came out, because SMART programs tried to ask the wrong thing). When I am able to get both disks working I am getting 1.5GB/s sustained transfers to the NVMe and 500MB/s sustained to the SSDs.

This doesn't mean it doesn't need to be investigated and fixed, but for a home-lab reduces the priority a tad.

Secondary, I think that there is a manufacturing/assembly issue. I was getting disk errors etc. which if a system is sensitive would cause the disk to be marked read-only/offline (suspect that VSAN is sensitive).

These errors however, follow the SATA cable. Mixing/moving disks/cables through the systems, the fault is reproduceable 100% of the time with two of the cables I have (and if the third is bent in the wrong direction) reproduceable on that one too.

Looking at the cables, there is a 'sharp' bend where it goes into the SATA socket, and I suspect that this is an area of weakness. In theory it would be 'safe' within the chassis, but I suspect that when the systems were assembled it has been knocked, and on at least 1 of my cables there is removal of the protective shield from this area.

What is interesting is for one cable that these errors on the AHCI chip also appear to affect the NVMe in some cases, so it is perhaps a ground channel that is being affected most. Also on one of the cables these errors are only occurring when writing to sectors of the disk above a particular value, so might be a data channel in this case.

Unfortunately, it appears that there is nowhere that these cables can be bought as spares at the moment, so it is a case of complaining to the supplier, and ultimately a support call to Intel it seems

Reply
0 Kudos