Highlighted
Enthusiast
Enthusiast

esxi 6.5u2 - nvme disk only shows up IF another disk exists

Jump to solution

So ive come across a werid issue in my new test system (just a test setup, esxi 6.5u2 , host only, no vCenter):

I have a 2.5" p3700 nvme disk attached to my supermicro x10 based system (its direct attached via a sm riser card= AOC-2UR8N4-i2XT , which has 4x nvme ports , which goes to a sm nvme backplane, i dont think this HW is relevant though).

I had been running the system great for about 2 weeks, i had a datastore created direct on the nvme disk with several VMs running from it.

Yesterday my power flickered and when it came back my nvme data store was gone, and under storage->adaptors the nvme "hba" wouldnt show up  I tried removing and reinserting the nvme disk, trying a different nvme bay, shutdown/rebooting host a few times. nothing.

However, this entire time i could see the disk via gui, manage->HW->PCI devices (see image) and also via ssh LSPCI:

0000:08:00.0 Mass storage controller: Intel Corporation DC P3700 SSD [2.5" SFF]

assuming the flash was fried or something, i boot the sys into a ubuntu live CD, and under the disks utility, there is the p3700 disk, with its proper 1.6tb VMFS partition intact.

I then update/patch from 6.5U2 (may2018) to 6.5u2 (latest patches ~ nov1 2018). Reboot, still no nvme show up.


I happened to attach a random sata disk (to a MB sata port), format it as a datastore (VMFS), then reboot, and BOOM the nvme datastore is back! if i remove the sata disk, reboot (so that the nvme is the only attached disk appear to esxi), the NVMe again wont appear! 

So it seems that as long as i have some kind of other disk attached, my nvme appears properly.

Any ideas what this is about?

thanks!

Tags (2)
1 Solution

Accepted Solutions
Highlighted
Enthusiast
Enthusiast

well to toally wrap up this issue,  supermicro support today confirmed to me they *ARE* able to reproduce this issue, but will not be fixing (i would understand a bit more if the p3700 wasnt such a relevant and popular drive, and still amoung the fastest with high WE):

Hi XXX,

I was able to reproduce the Intel P3700 U.2 NVME issue that you were seeing. Since Intel P3700 NVME is EOL, so we don't plan to debug and fix the issue. If you plan to use Intel NVME for your system, please consider Intel P4600 or the new Intel P4700 series.

I will go ahead close the ticket by the end of the day.

Best Regards
XXX (supermicro support rep name removed)

View solution in original post

0 Kudos
7 Replies
Highlighted
Enthusiast
Enthusiast

no ideas?

i have tried updating the FW on the p3700,  same result.

I also have a 2nd , different 2.5" nvme (a hgst),  and im seeing the exact same issue on that as well.

booting / shutdown / booting into ubuntu the nvme always shows up (including showing the full EMFS partition), same with win 2012r2 - the drive consistently shows up after reboots/power cycles.  So the nvme issue seems to be isolated to esxi , any ideas where i can look?

EDIT:  SEE my REPLY below with the logs

thanks

0 Kudos
Highlighted
Enthusiast
Enthusiast

im still having this issue but have the following new info:

(btw, my exact supermicro system is on the vmware HCL for 5.5u3 through 6.7u1  -  6028U-E1CNRT+)

1- ive fully wiped the drive (both with cleanall in 2012r2 , and then after using the intel Datacenter tools to run this command:

​​isdct start -intelssd 0 -NVMeFormat LBAformat=0 SecureEraseSetting=0 ProtectionInformation=0 MetaDataSettings=0

(using 512 sector size)

2-  ive tried fresh installs (ie current release isos from vmware, no config just install and boot up) of  6.5u2 ,  6.5u2 (then patch to current patches) , 6.7 (no patches).

With either 6.5u2 , the issue is the worst,  sometimes reboot or shutdown will get it to show up (but rarely),  only way is by attaching other sata disks (and even then its not always, takes a few reboots / shutdowns).

with 6.7-  its not as bad,  ie:   if i shutdown , then power up,  it will be gone.  however if i then reboot it, it will appear.   its only the 1st boot up , after a shutdown that it will be gone.   i can reboot 5 times in a row, it will always appear.

(ALWAYS, under either 6.5 or 6.7, it always shows up under manage-> devices  like in my image above).

I have tried re-flashing/load defaults of my SM 3.1 bios (current, and what ive always been using),  no effect.

(issue never happens with any other os,  ie never with ubuntu, 2012r2 nor win10,  drive always shows up and is accessible)

EDIT: see reply below with relevant logs

tks

0 Kudos
Highlighted
Enthusiast
Enthusiast

Vme332c​ Request you to please share the screen shot of the disk device by going on to Host>>Storage>>Devices>> select disk in question.

Regards Pradhuman VCIX-NV, VCAP-NV, vExpert, VCP2X-DCVNV If my Answer resolved your query don't forget to mark it as "Correct Answer".
Highlighted
Enthusiast
Enthusiast

that for your reply,  however the only entry that appears under devices is my usb boot stick  (unless i reboot, and then the nvme will also appear there).   Ive attached images-  also i  have everything disconnected from this system possible (all pcie cards, all disks)

( i have repeated this issue with 6.5u2 and 6.7 .  both fully patched (and not).

(also my exact supermicro system is on the vmware HCL for 5.5u3 through 6.7u1  - a supermicro 6028U-E1CNRT+  --- Product Page LINK  )

(this issues does not occur with any other os,  ie never with ubuntu, 2012r2 nor win10,  drive always shows up and is accessible, every time)

I DID FIND THIS though, WHICH HAS TO BE THE CAUSE / SOURCE (not sure how to address this though):

cat /var/log/*.log | grep 0000:06:00.0

(note:  0000:06:00.0  is the DC p3700 SSD [2.5" SFF]  as seen in the Hardware tab of esxi-> manage)

2019-01-13T02:53:11Z shell[2099508]: [root]: cat /var/log/*.log | grep 0000:06:00.0

0:00:00:04.597 cpu0:2097152)VMKAcpi: 1098: Handle already exists in hash table for 0000:06:00.0

0:00:00:08.638 cpu0:2097152)PCI: 2161: 0000:06:00.0: Device is disabled by the BIOS, Command register 0x0

0:00:00:08.638 cpu0:2097152)PCI: 478: 0000:06:00.0: PCIe v2 PCI Express Endpoint

0:00:00:08.638 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x1 (Advanced Error Reporting)

0:00:00:08.638 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x2 (Virtual Channel)

0:00:00:08.638 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x4 (Power Budgeting)

0:00:00:08.638 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0xe (Alternative Routing-ID Interpretation)

0:00:00:08.638 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x3 (Device Serial Number)

0:00:00:08.638 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x19 (Secondary PCI Express)

0:00:00:08.638 cpu0:2097152)PCI: 141: Found physical slot 0x7 from ACPI _SUN for 0000:06:00.0

0:00:00:08.638 cpu0:2097152)PCI: 413: 0000:06:00.0: PCIe v2 PCI Express Endpoint

0:00:00:08.638 cpu0:2097152)PCI: 1067: 0000:06:00.0: probing 8086:0953 8086:3703

0:00:00:08.638 cpu0:2097152)PCI: 405: 0000:06:00.0: Adding to resource tracker under parent 0000:00:03.0.

0:00:00:08.638 cpu0:2097152)WARNING: PCI: 453: 0000:06:00.0: Failed to add BAR[0] (MEM64 f=0x4 0x0-0x4000) - out of resources on parent: 0000:00:03.0

0:00:00:08.638 cpu0:2097152)WARNING: PCI: 476: 0000:06:00.0: Failed to add BAR[0] (MEM64 f=0x4 0x0-0x4000) status: Limit exceeded

0:00:00:08.641 cpu0:2097152)PCI: 1624: 0000:06:00.0 8086:0953 8086:3703 unchanged, done with probe-scan phase already.

0:00:00:08.641 cpu0:2097152)PCI: 1282: 0000:06:00.0: registering 8086:0953 8086:3703

0:00:00:08.641 cpu0:2097152)PCI: 1301: 0000:06:00.0 8086:0953 8086:3703 disabled due to insufficient resources orbecause the device is not supported: Not supported

0:00:00:08.641 cpu0:2097152)WARNING: PCI: 679: 0000:06:00.0: Unable to free BAR[0] (MEM64 f=0x4 0x0-0x4000): Limit exceeded

0:00:00:08.638 cpu0:2097152)WARNING: PCI: 453: 0000:06:00.0: Failed to add BAR[0] (MEM64 f=0x4 0x0-0x4000) - out of resources on parent: 0000:00:03.0

0:00:00:08.638 cpu0:2097152)WARNING: PCI: 476: 0000:06:00.0: Failed to add BAR[0] (MEM64 f=0x4 0x0-0x4000) status: Limit exceeded

0:00:00:08.641 cpu0:2097152)WARNING: PCI: 679: 0000:06:00.0: Unable to free BAR[0] (MEM64 f=0x4 0x0-0x4000): Limit exceeded

Above is in contrast to when i reboot the system (which will make it so the nvme drives will show up),  and this is what the same output looks like, when nvme disk is working / appears:

cat /var/log/*.log | grep 0000:06:00.0

2019-01-13T03:06:01Z shell[2099499]: [root]: cat /var/log/*.log | grep 0000:06:00.0

0:00:00:04.590 cpu0:2097152)VMKAcpi: 1098: Handle already exists in hash table for 0000:06:00.0

0:00:00:08.631 cpu0:2097152)PCI: 478: 0000:06:00.0: PCIe v2 PCI Express Endpoint

0:00:00:08.631 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x1 (Advanced Error Reporting)

0:00:00:08.631 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x2 (Virtual Channel)

0:00:00:08.631 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x4 (Power Budgeting)

0:00:00:08.631 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0xe (Alternative Routing-ID Interpretation)

0:00:00:08.631 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x3 (Device Serial Number)

0:00:00:08.631 cpu0:2097152)PCI: 238: 0000:06:00.0: Found support for extended capability 0x19 (Secondary PCI Express)

0:00:00:08.631 cpu0:2097152)PCI: 141: Found physical slot 0x7 from ACPI _SUN for 0000:06:00.0

0:00:00:08.631 cpu0:2097152)PCI: 413: 0000:06:00.0: PCIe v2 PCI Express Endpoint

0:00:00:08.631 cpu0:2097152)PCI: 1067: 0000:06:00.0: probing 8086:0953 8086:3703

0:00:00:08.631 cpu0:2097152)PCI: 405: 0000:06:00.0: Adding to resource tracker under parent 0000:00:03.0.

0:00:00:08.635 cpu0:2097152)PCI: 1624: 0000:06:00.0 8086:0953 8086:3703 unchanged, done with probe-scan phase already.

0:00:00:08.635 cpu0:2097152)PCI: 1282: 0000:06:00.0: registering 8086:0953 8086:3703

2019-01-13T02:56:56.775Z cpu11:2097591)PCI: 1254: 0000:06:00.0 named 'vmhba2' (was '')

2019-01-13T02:56:59.235Z cpu11:2097591)VMK_PCI: 914: device 0000:06:00.0 pciBar 0 bus_addr 0xfb110000 size 0x4000

2019-01-13T02:56:59.235Z cpu11:2097591)VMK_PCI: 764: device 0000:06:00.0 allocated 2 MSIX interrupts

It i really were out of pcie lanes, or some other resources, than why does it go away after 1 reboot,  and also why does it not affect other OSs (baremetal) ?

tks

0 Kudos
Highlighted
Enthusiast
Enthusiast

Did you check if the Bios and Firmware version which is in use is up to the mark. Please check the HCL and update the firmware of all devices along with the Bios. I have seen this issue once, which was fixed with the Bios and firmware update.

As you have already mentioned that all the version of ESXi has same behaviour on this machine, but other machines are working fine, then its worth trying, if it still doesn't get fixed then it's worth checking the CPU from hardware prospective.

Regards Pradhuman VCIX-NV, VCAP-NV, vExpert, VCP2X-DCVNV If my Answer resolved your query don't forget to mark it as "Correct Answer".
Highlighted
Enthusiast
Enthusiast

thanks,  just to update-  this turned out to be an issue with all 2.5" intel p3700 nvme drives and my particular supermicro sys (nvme backplane + 4x nvme ports via a AOC) -  im running an entire sm sys, all of which is on the HCL for up to 6.7  (the p3700 is also on the HCL, but is not part of the SM sys ofcourse).

while other OSes do not exhibit this same issue,  i noticed that when i cold boot, the p3700 does not show up in the BIOS,  so somehow this fact affects ESXi but not other OSes.

Also, ive tried other NVMe enterprise drives in the exact same slot/position as the 2.5" p3700, and they do NOT exhibit this issue (ie they always show up on cold boot in esxi) .

also, the pcie (card not 2.5") version of the p3700 also does NOT exhibit the issue.  so its something with the 2.5" p3700 and/or the MB and/or the nvme AOC.  I think its the p3700 moreso, as there are others (a very few) with consumer systems (and consumer OSes) that i found reports of a similar issue (2.5" p3700 wont show up in the BIOS on cold boot thus they cant boot OS from p3700, unless they warm boot, then they are good).

All FWs are latest ( MB, p3700).    So for now its something we just work around.

if anyone has any other info or similar issues, pls post here,  thanks.

0 Kudos
Highlighted
Enthusiast
Enthusiast

well to toally wrap up this issue,  supermicro support today confirmed to me they *ARE* able to reproduce this issue, but will not be fixing (i would understand a bit more if the p3700 wasnt such a relevant and popular drive, and still amoung the fastest with high WE):

Hi XXX,

I was able to reproduce the Intel P3700 U.2 NVME issue that you were seeing. Since Intel P3700 NVME is EOL, so we don't plan to debug and fix the issue. If you plan to use Intel NVME for your system, please consider Intel P4600 or the new Intel P4700 series.

I will go ahead close the ticket by the end of the day.

Best Regards
XXX (supermicro support rep name removed)

View solution in original post

0 Kudos