Highlighted
Contributor
Contributor

ESXi drops / loses VMFS-partition

Hi

I'm experimenting with an ESXi 6.5 installation on a Intel NUC6I5SYH for a lab enviroment. I'm aware of the "no official support", but please hear me out 🙂

As stated, I'm installing ESXi on a Intel NUC which has a Intel 600p NVMe SSD installed. For the most part everything works fine, but in the last month I have experienced twice that the partitions on the SSD disappears from ESXi. A simple reboot of the device will bring everything back to normal, but during the time with no access to the content of the SSD, the VM's are (of cause) not responding.

I can, however, log on to the web-interface of ESXi 6.5 and from there I see that the SSD is still recognized (I can see the make and model of the SSD), but the capacity is "0 bytes". If I log on to the ESXi host via SSH and do a "df -h" I see two partitions: one which is around 4 GB and one which is 0 bytes. This makes me think, that the SSD is not totally dead,

Even though VMware is not supporting this setup, I wonder what my next troubleshooting step should be. Does the ESXi-installation have a CLI command to read out SMART-data or "rescan" the SSD for partitions? Something to guide me in a direction if I should RMA the SSD, the NUC or just give up on ESXi in this setup.

I don't really have any logs about the incident since ESXi doesn't have anywhere to write the logs to when this problem occurs.

Thanks!

0 Kudos
9 Replies
Highlighted
Contributor
Contributor

Hi,

I have exactly the same problem with my ASRock beebox with Intel 600p NVMe SSD. everytime I found its SSD "0-byte" and I was forced to reboot esxi.

I have tried firmware upgrade for Intel 600p SSD but no help.

Anyone has same problem and got solutions?

0 Kudos
Highlighted
Enthusiast
Enthusiast

Hi,

You can look at all your storage device by use SSH on the ESXi host.

#esxcli storage core device list

Thank you,

Olivier

Please, visit my blog http://www.purplescreen.eu/
0 Kudos
Highlighted
Contributor
Contributor

Hi,

I doubt if it is "nvme" driver bug.

anyway my "esxcli storage core device list" below:

t10.NVMe____INTEL_SSDPEKKW512G7_____________________BTPY631307NR512F____00000001

   Display Name: Local NVMe Disk (t10.NVMe____INTEL_SSDPEKKW512G7_____________________BTPY631307NR512F____00000001)

   Has Settable Display Name: true

   Size: 488386

   Device Type: Direct-Access

   Multipath Plugin: NMP

   Devfs Path: /vmfs/devices/disks/t10.NVMe____INTEL_SSDPEKKW512G7_____________________BTPY631307NR512F____00000001

   Vendor: NVMe

   Model: INTEL SSDPEKKW51

   Revision:  PSF

   SCSI Level: 6

   Is Pseudo: false

   Status: on

   Is RDM Capable: false

   Is Local: true

   Is Removable: false

   Is SSD: true

   Is VVOL PE: false

   Is Offline: false

   Is Perennially Reserved: false

   Queue Full Sample Size: 0

   Queue Full Threshold: 0

   Thin Provisioning Status: yes

   Attached Filters:

   VAAI Status: unknown

   Other UIDs: vml.0100000000425450593633313330374e523531324620202020494e54454c20

   Is Shared Clusterwide: false

   Is Local SAS Device: false

   Is SAS: false

   Is USB: false

   Is Boot USB Device: false

   Is Boot Device: true

   Device Max Queue Depth: 256

   No of outstanding IOs with competing worlds: 32

   Drive Type: unknown

   RAID Level: unknown

   Number of Physical Drives: unknown

   Protection Enabled: false

   PI Activated: false

   PI Type: 0

   PI Protection Mask: NO PROTECTION

   Supported Guard Types: NO GUARD SUPPORT

   DIX Enabled: false

   DIX Guard Type: NO GUARD SUPPORT

   Emulated DIX/DIF Enabled: false

0 Kudos
Highlighted
Contributor
Contributor

i have the same problem too.

when it comes to a large data copy from nvme to hdd .it just drop the partition and reboot will fix it.

i wonder if the nvme is overheating  cause partition drop?

0 Kudos
Highlighted
Contributor
Contributor

Hi,

I am having the exact same issue as well with the Intel 600P and ESXi 6.5 U1 running on a SuperMicro SYS-5028D-TN4T​. It seems to be working fine until I try and provision a VM and then I get an error message that connection to the Datastore has been lost. I have updated to the latest Intel 600P firmware, I get the output for esxcli storage core device list as follows:

[root@pESXi-01:~] esxcli storage core device list

t10.NVMe____INTEL_SSDPEKKW010T7_____________________BTPY65320GA71P0H____00000001

   Display Name: Local NVMe Disk (t10.NVMe____INTEL_SSDPEKKW010T7_____________________BTPY65320GA71P0H____00000001)

   Has Settable Display Name: true

   Size: 976762

   Device Type: Direct-Access

   Multipath Plugin: NMP

   Devfs Path:

   Vendor: NVMe

   Model: INTEL SSDPEKKW01

   Revision:  PSF

   SCSI Level: 6

   Is Pseudo: false

   Status: not connected

   Is RDM Capable: false

   Is Local: true

   Is Removable: false

   Is SSD: true

   Is VVOL PE: false

   Is Offline: false

   Is Perennially Reserved: false

   Queue Full Sample Size: 0

   Queue Full Threshold: 0

   Thin Provisioning Status: yes

   Attached Filters:

   VAAI Status: unsupported

   Other UIDs: vml.01000000004254505936353332304741373150304820202020494e54454c20

   Is Shared Clusterwide: false

   Is Local SAS Device: false

   Is SAS: false

   Is USB: false

   Is Boot USB Device: false

   Is Boot Device: false

   Device Max Queue Depth: 256

   No of outstanding IOs with competing worlds: 32

   Drive Type: unknown

   RAID Level: unknown

   Number of Physical Drives: unknown

   Protection Enabled: false

   PI Activated: false

   PI Type: 0

   PI Protection Mask: NO PROTECTION

   Supported Guard Types: NO GUARD SUPPORT

   DIX Enabled: false

   DIX Guard Type: NO GUARD SUPPORT

   Emulated DIX/DIF Enabled: false

I would be extremely grateful is someone has found a fix and can share.

0 Kudos
Highlighted
Contributor
Contributor

my problem is fixed after attached a small heatsink on the controller

i suggest that you better check your temperature by running following command:

esxcli storage core device list | grep '  Display Name:' | cut -d'(' -f2 | cut -d')' -f1 | while read DISK

do

   echo "********** $DISK **********"

   esxcli storage core device smart get -d $DISK

done

0 Kudos
Highlighted
Contributor
Contributor

Thanks ivanyeung510,

It is definitely a heat related issue. I had the fan setting set to Optimal speed and I have had to put it on full speed to keep the drive working which unfortunately is significantly noiser. It looks like I will need to add a heatsink myself to allow for the quieter fan.

I ran the command that you provided and the heat when I started to have issues was only 45 degrees which surprised me, I thought it would have had a higher threshold before I started to see the issues.

0 Kudos
Highlighted
Contributor
Contributor

nvme_overheat.jpg

i have a experience on 70 degree

after installing a small heatsink ,the maximum temperature <50

0 Kudos
Highlighted
Contributor
Contributor

Hi,

I have exactly the same problem with my ASRock beebox with Intel 600p NVMe SSD. everytime I found its SSD "0-byte" and I was forced to reboot esxi.

I have tried firmware upgrade for Intel 600p SSD but no help.

Have you find any solution about this issue?

0 Kudos