VMware Cloud Community
vmmpa
Contributor
Contributor
Jump to solution

NVMe health monitoring

Hi,

i'm using a Samsung 1725b NVMe on ESXi 7.0 and wonder what are people using to

- monitor the health (tbw, errrors, temperature)

- predict failures based on these (few) data

For a normal SSD, i get a lot of information when using

#  esxcli storage core device  smart get -d  ID

Parameter                                                Value  Threshold  Worst  Raw
Health Status                OKN/A  N/AN/A
Media Wearout Indicator      995    99172
Write Error Count            10010   1000
Power-on Hours               920    92151
Power Cycle Count            990    9914
Reallocated Sector Count     10010   1000
Drive Temperature            690    6331
Write Sectors TOT Count      990    9939
Read Sectors TOT Count       990    9940
Initial Bad Block Count      10010   1000
Program Fail Count           10010   1000
Erase Fail Count             10010   1000
Uncorrectable Error Count    1000    1000
Pending Sector Reallocation Count  1000    1000

For the NVMe i only have this:

Parameter                                          Value      Threshold  Worst  Raw
Health Status       OKN/A  N/AN/A
Power-on Hours      1677  N/AN/AN/A
Power Cycle Count   3N/A  N/AN/A
Reallocated Sector Count  090   N/AN/A
Drive Temperature   3679   N/AN/A

There were some efforts to get smartctl up and running, but everything unofficial.

https://www.virten.net/2016/05/determine-tbw-from-ssds-with-s-m-a-r-t-values-in-esxi-smartctl/

Thanks for info.

     -Mark

Reply
0 Kudos
1 Solution

Accepted Solutions
bmrkmr
Enthusiast
Enthusiast
Jump to solution

This may get you started:

esxcli nvme device list

esxcli nvme device log smart get -A vmhba0

esxcli nvme device log error get -e 64 -A vmhba0

View solution in original post

Reply
0 Kudos
7 Replies
bmrkmr
Enthusiast
Enthusiast
Jump to solution

This may get you started:

esxcli nvme device list

esxcli nvme device log smart get -A vmhba0

esxcli nvme device log error get -e 64 -A vmhba0

Reply
0 Kudos
vmmpa
Contributor
Contributor
Jump to solution

Yeah nice. I can feed that to the monitoring system

Reply
0 Kudos
ChristoBresler
Contributor
Contributor
Jump to solution

Im wondering if you can help me with a similar issue. 

I have x 2, 1.6 TB NVME Dell drives in the R740 server that we use. 

We recently started having issues with them and got them replaced by the warranty. 

However these "New" drives seem to display the exact same errors as the old ones in the VMware environment

When I sent the Idrac logs to Dell they say that they cant see any issues with the drives.

 


When i run the S.M.A.R.T Command i get the following output

[root@esxi-r740:~] esxcli storage core device smart get -d t10.NVMe____Dell_Ent_NVMe_AGN_MU_AIC_1.6TB__________920800019C382500
Parameter Value Threshold Worst Raw
------------------------ -------- --------- ----- ---
Health Status OK N/A N/A N/A
Power-on Hours 59 N/A N/A N/A
Power Cycle Count 2 N/A N/A N/A
Reallocated Sector Count 0 90 N/A N/A
Drive Temperature 33 70 N/A N/A
Write Sectors TOT Count 1334000 N/A N/A N/A
Read Sectors TOT Count 44260000 N/A N/A N/A

[root@esxi-r740:~] esxcli storage core device smart get -d t10.NVMe____Dell_Express_Flash_PM1725b_1.6TB_AIC____332A000121382500
Parameter Value Threshold Worst Raw
------------------------ -------- --------- ----- ---
Health Status OK N/A N/A N/A
Power-on Hours 59 N/A N/A N/A
Power Cycle Count 3 N/A N/A N/A
Reallocated Sector Count 0 90 N/A N/A
Drive Temperature 39 70 N/A N/A
Write Sectors TOT Count 76960000 N/A N/A N/A
Read Sectors TOT Count 93739000 N/A N/A N/A

 

We get 2 warning messages in the Syslog for both drives that is the following :

2021-06-21T08:12:28Z smartd: [warn] t10.NVMe____Dell_Express_Flash_PM1725b_1.6TB_AIC____332A000121382500: REALLOCATED SECTOR CT below threshold (0 < 90)


2021-06-21T08:12:28Z smartd: [warn] t10.NVMe____Dell_Ent_NVMe_AGN_MU_AIC_1.6TB__________920800019C382500: REALLOCATED SECTOR CT below threshold (0 < 90)

 

So from what i can find i understand the values of the S.M.A.R.T Are normalized, Thus Higher is better. However on my output for Relocated Sector count the count or current absolute value is 0 and the threshold is 90. 
So from that it makes sense to draw the conclusion that since the values are normalized, the Current value of 0 is very bad and is indeed lower than its threshold of 90. (considering it possibly started at about 200 )

However when we installed the "Out the box" drives from dell, We immediately get the same error messages. 
When i check the drives we get the same values as the old. Both sets of drives display the warning in the sys log exactly besides the drive name

2021-06-21T08:12:28Z smartd: [warn] (Drive name) REALLOCATED SECTOR CT below threshold (0 < 90)


I dont understand, Am i reading the values wrong?, Is is possible for both the old set of drives and the brand new set of drives to have the same relocation sector count of 0, Is it possible it starts at 0 and goes up and only when it passes 70 is there an issue? If that is the case why do we keep getting the warning message for threshold reached in the Syslog files. 

The only explanation have so far is that it is known for dell to Refurbish parts and then re-sell them or use them for warranty purposes. ( They take your old parts, Fix them and give them elsewhere. ) This is the only explanation i have that if all the values are normalized and higher is indeed better, Maybe Dell gave me drives that have been refurbished and their relocated sector count is also 0.  However  i need to understand better if this is the case and i will then need to inform Dell of the findings as they seem to think there is no issues on the two new drives. 

Any assistance would be appreciated. 

Thanks 

Regards

 

 

 

 

Reply
0 Kudos
daniel521
Contributor
Contributor
Jump to solution

I was running into the same problem and found that there is a different ESXi command for viewing the SMART values for nvme drives.

esxcli nvme device log smart get -A [nvme_adapter]

See this KB article for details: https://kb.vmware.com/s/article/83150

To get the nvme_adapter value for your system, you can run the following and look for the "Adapter" field.

esxcli storage core path list

The ESXi monitoring generating the logs appears to be using the command for spinning disks on the nvme drives, causing these erroneous warnings to be displayed.

Reply
0 Kudos
BarryGrowler
Enthusiast
Enthusiast
Jump to solution

I would recommend considering Nakivo's monitoring tool to help track NVMe health and performance. While ESXi provides basic NVMe telemetry, there are clear gaps that Nakivo can help fill. Its dashboards aggregate system-wide VM storage latency, IOPS, and capacity metrics that offer clues to underlying NVMe problems before they cascade.

Reply
0 Kudos
depping
Leadership
Leadership
Jump to solution


@BarryGrowler wrote:

I would recommend considering Nakivo's monitoring tool to help track NVMe health and performance. While ESXi provides basic NVMe telemetry, there are clear gaps that Nakivo can help fill. Its dashboards aggregate system-wide VM storage latency, IOPS, and capacity metrics that offer clues to underlying NVMe problems before they cascade.


Dear Barry, please stop posting only about Nakivo, clearly you have some kind of affiliation, this is this VMware Community Forum, not the "Let's point everyone to Nakivo Forum".

JuliaVMTN
Community Manager
Community Manager
Jump to solution

Hi Barry, 

 

This is a reminder that our community guidelines strictly prohibit directing VMware Community users to other companies' products or personal businesses. This guideline is in place to ensure that discussions remain focused on VMware-related topics and to uphold the integrity of our community.

We kindly request your cooperation in refraining from such actions moving forward. If you have any questions about our community guidelines or need further clarification, please don't hesitate to reach out to me or another moderator.

Thank you for your understanding and continued participation in the VMware Community.

Best regards,

 

Julia 

Reply
0 Kudos