ESXi 7.0 M.2 NVMe SSDs failing

NVMEIO · ‎11-01-2021

Hello,
we’re running ESXi-7.0.2-17630552-standard on two machines with ASUSTeK Z11PA-D8 mainboard each equipped with two M.2 Samsung SSD 980 PRO 2TB (MZ-V8P2T0) as local datastores.
When copying vmdk files or creating snapshots of virtual machines these datastores occasionally become unavailable. The device status of these M.2 SSDs is “Error, Timeout” and rescan doesn’t help. Only reboot of the host brings these devices back to work.
I'd appreciate advice to solve this issue.

Log:
2021-10-22T10:31:27.693Z: [APDCorrelator] 822002155us: [vob.storage.apd.start] Device or filesystem with identifier [t10.NVMe____Samsung_SSD_980_PRO_2TB_________________C16DB311B2382500] has entered the All Paths Down state.
2021-10-22T10:31:27.387Z cpu32:2100929)WARNING: NVMEIO:3303 Controller 256 receives async event: type 1, info 1, log page ID 2.
2021-10-22T10:31:27.388Z cpu0:2097827)WARNING: NVMEDEV:7858 Critical warning 0x2 detected, failing controller 256.
2021-10-22T10:31:27.388Z cpu24:2098011)WARNING: NVMEPSA:493 psaCmd 0x45b8e5223200 failed to submit. Controller 256 in state: 9.

sw00b · ‎11-03-2021

Hello,

I have the same issue with an Intel MA2000 M.2 drive running a couple of vm's on Esxi 7.0u3. It becomes unresponsive every 24 to 48 hours..
I have a second, identical drive where i didn't notice any issue (yet?). It doesn't do a lot of IO though, just a some replication from time to time, no vm's.

Motherboard is MSI Z590 chipset.

Some logfiles:

Hi,

I have the same issue with an Intel AM2000 M.2 drive (1TB) and Esxi 7.0u3
Host runs fine for 24-48 hours and then suddenly drive becomes unavailable.

I have a second, identical drive in the same server which which doesn't gave any errors so far.
Might be related to the fact that the failing drive is hosting the virtual machines and the second one is only used for backup replication. So there is no constant IO on the second drive..

Below some logs:

HOSTD.LOG

2021-11-03T18:41:22.997Z info hostd[1051749] [Originator@6876 sub=Hostsvc.VmkVprobSource] VmkVprobSource::Post event: (vim.event.EventEx) {

--> key = 189,

--> chainId = 586613328,

--> createdTime = "1970-01-01T00:00:00Z",

--> userName = "",

--> host = (vim.event.HostEventArgument) {

--> name = "localhost.home.int",

--> host = 'vim.HostSystem:ha-host'

--> },

--> ds = (vim.event.DatastoreEventArgument) {

--> name = "VMs (M.2 SLOT 2)",

--> datastore = 'vim.Datastore:617e3cee-3608d327-9016-d8bbc1172c12'

--> },

--> eventTypeId = "esx.problem.vmfs.heartbeat.timedout",

--> arguments = (vmodl.KeyAnyValue) [

--> (vmodl.KeyAnyValue) {

--> key = "1",

--> value = "617e3cee-3608d327-9016-d8bbc1172c12"

--> },

--> (vmodl.KeyAnyValue) {

--> key = "2",

--> value = (vim.event.DatastoreEventArgument) {

--> name = "VMs (M.2 SLOT 2)",

--> datastore = 'vim.Datastore:617e3cee-3608d327-9016-d8bbc1172c12'

--> }

--> ],

--> objectId = "617e3cee-3608d327-9016-d8bbc1172c12",

--> objectType = "vim.Datastore",

--> objectName = "VMs (M.2 SLOT 2)",

--> }

VOBD.LOG

2021-11-01T07:22:47.139Z: [vmfsCorrelator] 47755689635us: [vob.vmfs.heartbeat.timedout] 617e3cee-3608d327-9016-d8bbc1172c12 VMs (M.2 SLOT 2)

2021-11-01T07:22:47.139Z: [vmfsCorrelator] 47755689847us: [esx.problem.vmfs.heartbeat.timedout] 617e3cee-3608d327-9016-d8bbc1172c12 VMs (M.2 SLOT 2)

2021-11-03T18:41:22.997Z: [vmfsCorrelator] 203512516605us: [vob.vmfs.heartbeat.timedout] 617e3cee-3608d327-9016-d8bbc1172c12 VMs (M.2 SLOT 2)

2021-11-03T18:41:22.997Z: [vmfsCorrelator] 203512516794us: [esx.problem.vmfs.heartbeat.timedout] 617e3cee-3608d327-9016-d8bbc1172c12 VMs (M.2 SLOT 2)

VMKERNEL

2021-11-03T18:41:24.798Z cpu11:1052169)HBX: 3058: 'VMs (M.2 SLOT 2)': HB at offset 3899392 - Waiting for timed out HB:

2021-11-03T18:41:30.505Z cpu14:1051741 opID=1d4db708)HBX: 3058: 'VMs (M.2 SLOT 2)': HB at offset 3899392 - Waiting for timed out HB:

2021-11-03T18:41:34.800Z cpu0:1052169)HBX: 3058: 'VMs (M.2 SLOT 2)': HB at offset 3899392 - Waiting for timed out HB:

2021-11-03T18:41:44.801Z cpu0:1052169)HBX: 3058: 'VMs (M.2 SLOT 2)': HB at offset 3899392 - Waiting for timed out HB:

Any help would be appreciated!

sw00b · ‎11-03-2021

I just moved all vm's to the second NVMe drive. If this one has the same issue then at least i know drive 1 isn't faulty.

EDV-Schuster · ‎09-20-2022

i have failing samsung m.2 980 pro ssds in my esxi 7.0

did you find a solution?

BY3 · ‎09-21-2022

Seems to me like esxi 7.0.3 is not working well with my NVMe - Samsung 980 Pro SSD 1TB Interface PCIe Gen 4x4 either.

If I start moving data around on it then it will suually just stop responding. I have a second M2 from different manufacturer and no issues.

If anyone knows of any solution or ideas I would love to hear it. Been searching the internet and seen a few others witht the same issue but no solutions other than taking it out and replacing with different brand.

sisyphus2 · ‎09-21-2022

It`s the controller inside the ssd that is overheating on heavy workloads. You could try to improve cooling by installing a heatsink or an active cooler on the ssd.

Marc2marc · ‎10-14-2023

Hi,

I´ve got the same problem with a Samsung 990 Pro NVMe.
Is there a recommendation for a M2 NVMe SSD without this heating problem?

My setup is a Supermicro H13SSL-NT mainboard.
Kind regards

Marc