vSAN1

View Only

Back to discussions

Expand all | Collapse all

Solid State Disk Wear Status Change alerts on DL380 G9

1. Solid State Disk Wear Status Change alerts on DL380 G9

1 Recommend
markotsg80
Posted Feb 11, 2018 08:23 PM

Reply Reply Privately
We have DL380 G9 servers and run ESXI 6.0 with VSAN 6.2
.
We use HP Sim for monitoring and SIM seem to be giving us lot of predictive failiure alerts (mainly for ssd drives) including alerts like bellow
Solid State Disk Wear Status Change
This is contraditicting and seem to be false alerts, as ESXI shows these disks as healthy, ILO shows them as healthy even SIM shows them as healthy (Health status)

Has anyone came accross this before?
We use latest firmware and P440 controllers in HBA mode
2. RE: Solid State Disk Wear Status Change alerts on DL380 G9

1 Recommend
TheBobkin
Posted Feb 11, 2018 09:32 PM

Reply Reply Privately
Hello markotsg80,
While vSAN/ESXi may be capable of keeping these drives mounted and in use (and thus appear 'healthy'), blocks do wear out over time and after extended use are more prone to failure/corruption - SSDs typically cycle the usage of blocks so that they wear more evenly and manufacturers reserve a % space to use to replace failed blocks.
What % wear are your SSDs down to?
Whether considering replacing them at 25-30% or 10% really depends on the criticality of the data/uptime, though I would strongly advise against allowing multiple cache-tier devices to get below these levels as the risk of double-failure or failure during rebuild may be increased.
You can check from ESXi via 'smart' if you are concerned this is a false positive (but then again this likely gathers these stats from the same source).
# esxcli storage core device list
# esxcli storage core device smart get -d <device>
https://kb.vmware.com/s/article/2040405
This can also be generated in graph form for all devices on a host via log bundle collection:
# vm-support -w <directoryForStoringBundle>
# tar -xvf esxiName_support_bundle.tgz
# less ExtractedBundleName/commands/smartinfo.txt
Bob
3. RE: Solid State Disk Wear Status Change alerts on DL380 G9

0 Recommend
markotsg80
Posted Feb 11, 2018 09:52 PM

Reply Reply Privately
Many thanks
Will try this
From memory when i run the esxcli storage core device smart get -d <device>, most of the details where marked as n/a, apart from status ok.
Will run it again and will atach the screenshot.
Will check the smart.txt file as well from the bundle to see if it gives any more usefull details
4. RE: Solid State Disk Wear Status Change alerts on DL380 G9

0 Recommend
markotsg80
Posted Feb 11, 2018 10:20 PM

Reply Reply Privately
Hello Bob
This is what i get when runing esxcli storage core device smart get -d <device> command
Device: naa.5000cca01d2afaa8
Parameter                     Value Threshold Worst
-----------------------------------------------------
Health Status                 N/A    N/A        N/A
Media Wearout Indicator       N/A    N/A        N/A
Write Error Count             0      N/A        N/A
Read Error Count              0      N/A        N/A
Power-on Hours                N/A    N/A        N/A
Power Cycle Count             N/A    N/A        N/A
Reallocated Sector Count      N/A    N/A        N/A
Raw Read Error Rate           N/A    N/A        N/A
Drive Temperature             32     N/A        N/A
Driver Rated Max Temperature N/A    N/A        N/A
Write Sectors TOT Count       N/A    N/A        N/A
Read Sectors TOT Count        N/A    N/A        N/A
Initial Bad Block Count       N/A    N/A        N/A
5. RE: Solid State Disk Wear Status Change alerts on DL380 G9

0 Recommend
TheBobkin
Posted Feb 12, 2018 12:25 AM

Reply Reply Privately
Hello markotsg80,
It's possible you don't have some module installed that allows checking of these 'N/A' parameters, though are you also positive that these devices are being passed as pass-through as opposed to R0? (supported with the correct FW on 440ar)
Either way - the information being passed is going to be getting these from the hardware sensors/counters as I said previously, so you can go by these if you want to know the current wear-level of your devices.
Bob
6. RE: Solid State Disk Wear Status Change alerts on DL380 G9

0 Recommend
markotsg80
Posted Feb 12, 2018 10:41 AM
| view attached

Reply Reply Privately
This is what i get when running the smart capture
controller definitely in HBA mode
7. RE: Solid State Disk Wear Status Change alerts on DL380 G9

0 Recommend
markotsg80
Posted Feb 12, 2018 11:26 AM

Reply Reply Privately
---------------------------- ----- --------- -----
Health Status                 OK     N/A        N/A
Media Wearout Indicator       N/A    N/A        N/A
Write Error Count             0      N/A        N/A
Read Error Count              0      N/A        N/A
Power-on Hours                N/A    N/A        N/A
Power Cycle Count             40     N/A        N/A
Reallocated Sector Count      N/A    N/A        N/A
Raw Read Error Rate           N/A    N/A        N/A
Drive Temperature             23     N/A        N/A
Driver Rated Max Temperature N/A    N/A        N/A
Write Sectors TOT Count       N/A    N/A        N/A
Read Sectors TOT Count        N/A    N/A        N/A
Initial Bad Block Count       N/A    N/A        N/A
[root@sv230530:/bin]
8. RE: Solid State Disk Wear Status Change alerts on DL380 G9

0 Recommend
TheBobkin
Posted Feb 12, 2018 08:34 PM

Reply Reply Privately
Hello markotsg80
"Solid State Disk Wear Status Change"
Are you positive that this isn't just updating every time there is a change to the % remaining or something similar?
How often are you getting these alerts and always on the same drives or varying?
"
Usage remaining: 99.34%
Power On Hours: 3309
Estimated Life Remaining based on workload to date: 20752 days
"
This drive appears to have almost no wear on it and is okay.
"
Device: naa.5000cca01d2afaa8
Parameter                     Value Threshold Worst
-----------------------------------------------------
Health Status                 N/A    N/A        N/A
"
"
Health Status                 OK     N/A        N/A
"
Were these on different drives? (e.g. one capacity-tier, one cache-tier) Strange that it can see Health Status value on one but not other - as I was saying, you may need some other utility other than SIM to see these, HPE have a few AFAIK, maybe see what is available for download (HPE SSA for start).
Bob
9. RE: Solid State Disk Wear Status Change alerts on DL380 G9

0 Recommend
markotsg80
Posted Feb 13, 2018 11:54 AM

Reply Reply Privately
Just checked on the server where we received the predictive HP SIM alert, and when you run the esxcli smart, wear % is not showing, just the health status
We can only see number of days left and % usage remaining for SSD drives and not MD drives
interesting we see, 20752 days is listed as usage remaining 99.34%
Usage remaining: 99.34%
Power On Hours: 3309
Estimated Life Remaining based on workload to date: 20752 days
on another host, SSD
Usage remaining: 99%
Power On Hours: 3309
Estimated Life Remaining based on workload to date:9000 days
For .34% difference in usage remaining , large difference in number of days?

vSAN1

Solid State Disk Wear Status Change alerts on DL380 G9

markotsg80Feb 11, 2018 08:23 PM

TheBobkinFeb 11, 2018 09:32 PM

markotsg80Feb 11, 2018 09:52 PM

markotsg80Feb 11, 2018 10:20 PM

TheBobkinFeb 12, 2018 12:25 AM

markotsg80Feb 12, 2018 10:41 AM

markotsg80Feb 12, 2018 11:26 AM

TheBobkinFeb 12, 2018 08:34 PM

markotsg80Feb 13, 2018 11:54 AM

1. Solid State Disk Wear Status Change alerts on DL380 G9

2. RE: Solid State Disk Wear Status Change alerts on DL380 G9

3. RE: Solid State Disk Wear Status Change alerts on DL380 G9

4. RE: Solid State Disk Wear Status Change alerts on DL380 G9

5. RE: Solid State Disk Wear Status Change alerts on DL380 G9

6. RE: Solid State Disk Wear Status Change alerts on DL380 G9

7. RE: Solid State Disk Wear Status Change alerts on DL380 G9

8. RE: Solid State Disk Wear Status Change alerts on DL380 G9

9. RE: Solid State Disk Wear Status Change alerts on DL380 G9