VMware Cloud Community
misan
Enthusiast
Enthusiast

smartd Warnings for both NVMe drives ESXi 7.0.0u2

Hi - I'm seeing the following warnings on ESXi 7.0u2 fairly frequently for both of my Samsung 970 EVO Plus 1TB NVMe drives.  Both are connected to a AOC-SLG3-2M2 on a X11SDV-8C+-TLN2F board.

smartd: [warn] t10.NVMe____Samsung_SSD_970_EVO_Plus_1TB____________5702921152382500: REALLOCATED SECTOR CT below threshold (0 < 90)

Seems unlikely both drives are faulty...

I presume this is just a notification message and can be ignored? - It seems to indicate that the reallocated sector count is below the threshold - so is presumably operating normally.

If I run esxcli storage core device smart get -d=t10.NVMe____Samsung_SSD_970_EVO_Plus_1TB____________5702921152382500 for both drives

I see

Parameter Value Threshold Worst Raw
------------------------ ----- --------- ----- ---
Health Status OK N/A N/A N/A
Power-on Hours 1382 N/A N/A N/A
Power Cycle Count 8 N/A N/A N/A
Reallocated Sector Count 0 90 N/A N/A
Drive Temperature 57 85 N/A N/A

The motherboard is running on the latest BIOS and there are no firmware updates for the AOC-SLG3-2M2.

Many thanks

Chris

 

Reply
0 Kudos
6 Replies
depping
Leadership
Leadership

I have seen this reported before. the current value of reallocated sectors is still 0, so I wouldn't be too concerned about it. Let me see if I can find an explanation internally

Reply
0 Kudos
depping
Leadership
Leadership

https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes

 

I would probably check if there's new firmware for the device... that may resolve problems, other than that I have not seen any internal notes on this problem.

Reply
0 Kudos
misan
Enthusiast
Enthusiast

  • Hi @depping - Many thanks for the response.   I don't believe there is a newer firmware available for the drive at present, but will check. 
Reply
0 Kudos
misan
Enthusiast
Enthusiast

Nope - no newer firmware for the drive after checking

esxcli nvme device get -A vmhba3 | egrep "Serial Number|Model Number|Firmware Revision"

Perhaps the smartd notifications in syslog.log should be changed from [warn] to [info] if the thresholds are within normal operating parameters for future ESXi releases.

Kind Regards

Reply
0 Kudos
depping
Leadership
Leadership

Yes, not sure why this happens to be honest, something does trigger a warning, but the counter looks normal to me, which is strange.

Reply
0 Kudos
4Bob
Contributor
Contributor

Hi, I also receive this warning on 2 drives:

2022-06-12T16:12:39.546Z smartd[526564]: [warn] t10.NVMe____Samsung_SSD_980_PRO_1TB_________________A0D5B021B3382500: REALLOCATED SECTOR CT below threshold (0 < 90)

2022-06-12T17:44:28.466Z smartd[133353]: [warn] t10.NVMe____WDS500G2X0C2D00L350______________________414786448B441B00: REALLOCATED SECTOR CT below threshold (0 < 90)

Found this Post, complaining about the same behavior with supported device: Dell Express Flash PM1725b 1.6TB https://communities.vmware.com/t5/ESXi-Discussions/NVMe-health-monitoring/td-p/2312062 "Reallocated Sector Count" is not dispalyed for my NVMe drive when I use smartctl under Linux there is no such a value under === START OF SMART DATA SECTION ===. This should be removed? The drive smart log and device stats looks good in both cases.


[#:~] esxcli nvme device log smart get -A vmhba1
SMART And Health Info:
Available Spare Space Below Threshold: false
Temperature Warning: false
NVM Subsystem Reliability Degradation: false
Read Only Mode: false
Volatile Memory Backup Device Failure: false
Composite Temperature: 319 K
Available Spare: 100 %
Available Spare Threshold: 10 %
Percentage Used: 0 %
Data Units Read: 0x96bca
Data Units Written: 0x22ca6c
Host Read Commands: 0x25fd8e
Host Write Commands: 0xced422
Controller Busy Time: 0xa
Power Cycles: 0x8
Power On Hours: 0x34
Unsafe Shutdowns: 0x2
Media Errors: 0x0
Number of Error Info Log Entries: 0x0
Warning Composite Temperature Time: 0 Mins
Critical Composite Temperature Time: 0 Mins
Temperature Sensor 1: 319 K
Temperature Sensor 2: 330 K
Temperature Sensor 3: 0 K
Temperature Sensor 4: 0 K
Temperature Sensor 5: 0 K
Temperature Sensor 6: 0 K
Temperature Sensor 7: 0 K
Temperature Sensor 8: 0 K

[#:~] esxcli nvme device log error get -e 1 -A vmhba1
Error Info:
Error Count: 0x0
Submission Queue ID: 0
Command ID: 0
Status Field: 0
Byte in Command That Contained the Error: 0
Bit in Command That Contained the Error: 0
LBA: 0x0
Namespace: 0
Vendor Specific Information Available: 0

[#:~] esxcli storage core device stats get -d t10.NVMe____Samsung_SSD_980_PRO_1TB_________________A0D5B021B3382500
Device: t10.NVMe____Samsung_SSD_980_PRO_1TB_________________A0D5B021B3382500
Successful Commands: 2528426
Blocks Read: 37344027
Blocks Written: 161580797
Read Operations: 422227
Write Operations: 2095673
Reserve Operations: 692
Reservation Conflicts: 0
Failed Commands: 1
Failed Blocks Read: 0
Failed Blocks Written: 0
Failed Read Operations: 0
Failed Write Operations: 0
Failed Reserve Operations: 0
Reply
0 Kudos