microlytix
Enthusiast
Enthusiast

Strange device controller phenomenon after update to ESXi 7.0 U1d

I've just updated the ESXi hosts of my vSAN cluster from 7.0U1c to 7.0U1d

After reboot my cache device (Optane P4801X 100GB, on HCL) 'moved' from vmhba3 to vmhba64. Also the interface type changed from PCIE to SCSI and the controller name from "NVMe Datacenter SSD (Optane)" to "NVM Express Optane 4800X". There is no longer a PCI ID.
Attached image shows hosts before patch (U1c, green) and hosts after latest patch (U1d, yellow)
Disk device (Optane P4801X 100GB) is supported P4801X on HCL 
System is a Supermicro E300-9D-8CN8TP
Has anybody seen something similar?

blog: https://www.elasticsky.de/en
0 Kudos
12 Replies
depping
Leadership
Leadership

moved to vSAN, as some of the other vSAN users (or support people) may have witnessed it.

Tags (1)
0 Kudos
Lalegre
Virtuoso
Virtuoso

Hey @microlytix,

I am not saying this is the solution but your issue definitely matches some parts of the next KB and I think this is also related with the driver that comes in the new version but it should not be an issue at all: https://kb.vmware.com/s/article/2127274

0 Kudos
paudieo
VMware Employee
VMware Employee

Hi

I recall to have seeing this behaviour once on older 7.0 builds with vSAN P4800 series qualified devices
Could you post the precise driver amd firmware versions you are using for this device please?

you may want to file an SR to support  highlight all the PCI-IDs reporting zeros after upgrade. 

Has this negatively impacted anything since the upgrade or does this appear to be a display issue?

 

0 Kudos
TheBobkin
VMware Employee
VMware Employee

@microlytix , We have seen this a handful of times in GSS and IIRC (as never had a case myself) it was attributable to 2 sets of drivers identifying path to device and thus with extraneous 0 ID paths (e.g. vmhba3 alias is still there and correct IDs but also blank vmhba64 which is being picked up by vSAN Health) I think this was remediated as @Lalegre mentioned by removing the extraneous unused paths/aliases, however this isn't the same in 7.0 as while esx.conf still exists it no longer stores pci device aliases as these are stored in ConfigStore - should be just a case of removing them from there but the process in that kb isn't going to work in 7.0.

If you can open a Support Request we can likely do this, but otherwise I will have to see what the story is with whether this process is already documented in ikb and if present can we make it public and if not aim to write a kb with the process (provided manual modification of ConfigStore is something we can publish externally).

a_p_
Leadership
Leadership

>>> provided manual modification of ConfigStore is something we can publish externally
Some KB articles for how to modify the ConfigStore have already been published, so it may just be a matter of support.
Anyway, with an active support contract, I'd recommend to open a support case especially if this is a production system. This may not only help solving the issue, but also help the developers to identify, and fix the bug that's causing such issues.

To find out about the current device configuration, run the following command:
configstorecli config current get -c esx -g system -k device_data

André

 

0 Kudos
TheBobkin
VMware Employee
VMware Employee

@a_p_, Thanks as always for your (always useful) input.

Yes, I had a look at what is currently publicly published in this area and can only find this single kb https://kb.vmware.com/s/article/81722.

When making publicly available kbs, we (and anyone really) should be careful to consider what is the worst possible outcome from someone that doesn't fully comprehend the possible impact of the changes they are making (and/or by doing them incorrectly) and sometimes where specifics need to be targeted (e.g. use specific configstore IDs as opposed to --all as per the second option in that kb) or getting the correct syntax is non-linear this can result in this knowledge remaining internal as ikbs - don't get me wrong, I am all for sharing as much understanding of things as possible but there are lines and these can be hazy.

While I can't state any date/release, from what I have read (just now as was a long time since last I looked at the relevant PRs) the source of this issue appears to have been identified and resolved in an upcoming release.

But for now I would advise anyone encountering this to open a case with us - I will see what I can do about a kb and update here if this is possible.

0 Kudos
microlytix
Enthusiast
Enthusiast

thanks Duncan!

I wasn't sure where to post it, because it's host hardware and vSAN.

blog: https://www.elasticsky.de/en
0 Kudos
microlytix
Enthusiast
Enthusiast

Thank you all.

I'd like to provide some more information.I've looked at two of my hosts:

esx01 (updated to v7U1d) and esx02 (not updated, v7U1c)

First I looked at the two host clients. In both cases the device ID is 0000:65:00.0

Only the name has changed from "NVMe Datacenter SSD [Optane]" (before) to "NVM Express Optane 4800X"

Then I checked on the CLI:

[root@esx01:~] vmkchdev -l | grep vmhba
0000:00:11.5 8086:a1d2 15d9:0986 vmkernel vmhba0
0000:00:17.0 8086:a182 15d9:0986 vmkernel vmhba1
0000:65:00.0 8086:2701 8086:3907 vmkernel vmhba3
0000:66:00.0 144d:a808 144d:a801 vmkernel vmhba2

[root@esx02:~] vmkchdev -l | grep vmhba
0000:00:11.5 8086:a1d2 15d9:0986 vmkernel vmhba0
0000:00:17.0 8086:a182 15d9:0986 vmkernel vmhba1
0000:65:00.0 8086:2701 8086:3907 vmkernel vmhba3
0000:66:00.0 144d:a808 144d:a801 vmkernel vmhba2

Interesting, that here the original vmhba3 number has been kept on the updated host (esx01).

Let's look at the drivers:

[root@esx01:~] esxcli storage core adapter list
HBA Name Driver Link State UID Capabilities Description
-------- -------------- ---------- ------------- ------------ -----------
vmhba0 vmw_ahci link-n/a sata.vmhba0 (0000:00:11.5) Intel Corporation Lewisburg SATA AHCI Controller
vmhba1 vmw_ahci link-n/a sata.vmhba1 (0000:00:17.0) Intel Corporation Lewisburg SATA AHCI Controller
vmhba2 nvme_pcie link-n/a pcie.6600 (0000:66:00.0) Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
vmhba64 intel-nvme-vmd link-n/a pscsi.vmhba64 (0000:65:00.0) Intel Corporation NVM Express Optane 4800X

[root@esx02:~] esxcli storage core adapter list
HBA Name Driver Link State UID Capabilities Description
-------- --------- ---------- ----------- ------------ -----------
vmhba0 vmw_ahci link-n/a sata.vmhba0 (0000:00:11.5) Intel Corporation Lewisburg SATA AHCI Controller
vmhba1 vmw_ahci link-n/a sata.vmhba1 (0000:00:17.0) Intel Corporation Lewisburg SATA AHCI Controller
vmhba2 nvme_pcie link-n/a pcie.6600 (0000:66:00.0) Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
vmhba3 nvme_pcie link-n/a pcie.6500 (0000:65:00.0) Intel Corporation NVMe Datacenter SSD [Optane]

Here we can see that there's a new driver (intel-nvme-vmd) for the optane device and a new UID (pscsi.vmhba64).

Let's get some driver details:

[root@esx01:/proc] vmkload_mod -s intel-nvme-vmd
vmkload_mod module information
input file: /usr/lib/vmware/vmkmod/intel-nvme-vmd
Version: 2.0.0.1146-1OEM.700.1.0.15843807
Build Type: release
License: BSD
Required name-spaces:
com.vmware.vmkapi#v2_6_0_0
Parameters:
SNT_COMPAT: bool
SCSI-to-NVMe Compatibility mode. Set to false to use VMware non-compliant translations

[root@esx02:~] vmkload_mod -s nvme_pcie
vmkload_mod module information
input file: /usr/lib/vmware/vmkmod/nvme_pcie
Version: 1.2.3.9-2vmw.701.0.0.16850804
Build Type: release
License: BSD
Required name-spaces:
com.vmware.nvme#0.0.0.1
com.vmware.vmkapi#v2_7_0_0
Parameters:
nvmePCIEFakeAdminQSize: uint
NVMe PCIe fake ADMIN queue size. 0's based
nvmePCIEDma4KSwitch: int
NVMe PCIe 4k-alignment DMA
nvmePCIEDebugMask: int
NVMe PCIe driver debug mask
nvmePCIELogLevel: int
NVMe PCIe driver log level

 

I still don't understand why I get diffenent results regarding the device on the updated host. One query returns vmhba3 and another vmhba64.

blog: https://www.elasticsky.de/en
0 Kudos
microlytix
Enthusiast
Enthusiast

Thank you Andre

Here's the result: (shortened, removed vmnic information)

ESX01: (updated)

configstorecli config current get -c esx -g system -k device_data
[
{
"alias": "vmhba1",
"alias_pending": false,
"bus_address": "p0000:00:17.0",
"bus_type": "pci",
"cs_generated_id": "52 03 4d dc a6 22 ed 78-49 6f 77 ab 8b 96 ee 3d"
},
{
"alias": "vmhba64",
"alias_pending": false,
"bus_address": "pci#s00000007.00#0",
"bus_type": "logical",
"cs_generated_id": "52 2e 5d f7 1f 4f 44 7b-9e e6 90 0b 99 6c 81 2e"
},
{
"alias": "vmhba0",
"alias_pending": false,
"bus_address": "p0000:00:11.5",
"bus_type": "pci",
"cs_generated_id": "52 6a 1c e6 72 64 b0 55-49 61 41 ae 73 17 0a bd"
},
{
"alias": "vmhba3",
"alias_pending": false,
"bus_address": "logical#pci#s00000007.00#0#0",
"bus_type": "logical",
"cs_generated_id": "52 85 e6 b1 ec c7 f6 df-ed 86 95 76 dc 90 44 4b"
},
{
"alias": "vmhba2",
"alias_pending": false,
"bus_address": "s00000001.00",
"bus_type": "pci",
"cs_generated_id": "52 a8 55 e3 54 47 d8 ef-00 de 4c 41 bf fa fb fe"
},
{
"alias": "vmhba2",
"alias_pending": false,
"bus_address": "logical#pci#s00000001.00#0#0",
"bus_type": "logical",
"cs_generated_id": "52 ce 26 f3 a4 41 a5 53-14 b2 76 7b c7 e1 51 0e"
},
{
"alias": "vmhba3",
"alias_pending": false,
"bus_address": "s00000007.00",
"bus_type": "pci",
"cs_generated_id": "52 de 2b 74 02 20 d9 7a-49 2a 03 b5 1a ce 36 2d"
},
{
"alias": "vmhba1",
"alias_pending": false,
"bus_address": "pci#p0000:00:17.0#0",
"bus_type": "logical",
"cs_generated_id": "52 e3 b3 4b 55 78 82 c2-22 39 55 a9 22 16 80 37"
},
{
"alias": "vmhba0",
"alias_pending": false,
"bus_address": "pci#p0000:00:11.5#0",
"bus_type": "logical",
"cs_generated_id": "52 ec 69 08 68 b7 44 c6-6f 59 8d d7 19 f3 c0 83"
}

 

ESX02 (not updated) (also shortened without vmnic information)

configstorecli config current get -c esx -g system -k device_data

[
{
"alias": "vmhba2",
"alias_pending": false,
"bus_address": "s00000005.00",
"bus_type": "pci",
"cs_generated_id": "52 1a 04 5f 6d 26 0c dd-39 36 1b 74 34 17 5a 84"
},

{
"alias": "vmhba0",
"alias_pending": false,
"bus_address": "pci#p0000:00:11.5#0",
"bus_type": "logical",
"cs_generated_id": "52 41 05 11 53 b6 28 70-83 04 27 f4 e1 aa 47 5d"
},
{
"alias": "vmhba1",
"alias_pending": false,
"bus_address": "pci#p0000:00:17.0#0",
"bus_type": "logical",
"cs_generated_id": "52 4a fa 26 d5 01 32 5c-b5 f7 d9 7f d8 c6 98 35"
},
{
"alias": "vmhba3",
"alias_pending": false,
"bus_address": "logical#pci#s00000007.00#0#0",
"bus_type": "logical",
"cs_generated_id": "52 5f 21 f7 78 8a aa ee-89 12 2e a3 14 ea 55 43"
},
{
"alias": "vmhba1",
"alias_pending": false,
"bus_address": "p0000:00:17.0",
"bus_type": "pci",
"cs_generated_id": "52 7f bd 5e 12 86 f1 87-c6 d7 09 0a 4a 7e 09 f8"
},
{
"alias": "vmhba0",
"alias_pending": false,
"bus_address": "p0000:00:11.5",
"bus_type": "pci",
"cs_generated_id": "52 98 57 51 27 11 e9 3f-c0 0f a8 f0 e3 18 56 6c"
},
{
"alias": "vmhba3",
"alias_pending": false,
"bus_address": "s00000007.00",
"bus_type": "pci",
"cs_generated_id": "52 9a c4 7c 67 fc 28 60-68 8b d1 de 3f d0 c0 16"
},
{
"alias": "vmhba2",
"alias_pending": false,
"bus_address": "logical#pci#s00000005.00#0#0",
"bus_type": "logical",
"cs_generated_id": "52 c3 e0 d4 3a 9a e9 51-a7 46 b3 5f 89 07 30 9f"
},

]

blog: https://www.elasticsky.de/en
0 Kudos
TheBobkin
VMware Employee
VMware Employee

@microlytix, As I mentioned above, this (vmhba64) is the essentially duplicate entry being picked up:

vmhba64 intel-nvme-vmd link-n/a pscsi.vmhba64 (0000:65:00.0) Intel Corporation NVM Express Optane 4800X

 

Resulting in 2 logical mappings (note all other having 1:1 pci:logical entries) and that they point to the same pci bus address (pci#s00000007) (e.g. how one could confirm which mapped to which vmhba[0-4] if they had vmhba[64-67]):

"alias": "vmhba64",
"alias_pending": false,
"bus_address": "pci#s00000007.00#0",
"bus_type": "logical",
"cs_generated_id": "52 2e 5d f7 1f 4f 44 7b-9e e6 90 0b 99 6c 81 2e"
},
{
"alias": "vmhba3",
"alias_pending": false,
"bus_address": "logical#pci#s00000007.00#0#0",
"bus_type": "logical",
"cs_generated_id": "52 85 e6 b1 ec c7 f6 df-ed 86 95 76 dc 90 44 4b"
},
{
"alias": "vmhba3",
"alias_pending": false,
"bus_address": "s00000007.00",
"bus_type": "pci",
"cs_generated_id": "52 de 2b 74 02 20 d9 7a-49 2a 03 b5 1a ce 36 2d"

While these extraneous listings can be removed with 'configstorecli config current delete' (followed by reboot), but it needs to be stated once more that caution needs to be advised, I have seen this being applied using the cs_generated_id but not the vmhba alias and I will be brutally honest and say I am currently unaware if there is a difference (as ConfigStore has pretty much been a 99% SysOps thing until recently and I am the vSAN guy 🤓 ) but may be able to find out.

0 Kudos
microlytix
Enthusiast
Enthusiast

@TheBobkin  there's some interesting update to the issue.

As I mentioned above, I reverted all hosts but one to v7U1c.

I left esx01 on v7U1d for research purposes.

Today I've upgraded my hosts to v7.0.2

v7.0.1c -> v7.0.2 : vmhba for Optane still correct (vmhba3)

v7.0.1d -> v7.0.2 : vmhba for Optane remained renumbered (vmhba64)

I solved the problen (kind of) by redeploying esx01 with the new ESXi image 7.0.2.

I guess there was someting between 7.0.1c and 7.0.1d that caused the renumbering of the vmhba. Whatever it was, it's no longer part of 7.0.2.

 

blog: https://www.elasticsky.de/en
0 Kudos
TheBobkin
VMware Employee
VMware Employee

@microlytix, While I haven't confirmed first-hand I did hear one of my colleagues saying today that in an update that this had been resolved in 7.0 U2 in a case he had.

It could be a case of once it occurs and adds the entry to configStore that this is persisted regardless of updating to a release with the fix.

0 Kudos