VMware Cloud Community
markotsg80
Enthusiast
Enthusiast

Solid State Disk Wear Status Change alerts on DL380 G9

We have DL380 G9 servers and run ESXI 6.0 with VSAN 6.2

.

We use HP Sim for monitoring and SIM seem to be giving us lot of predictive failiure alerts (mainly for ssd drives) including alerts like bellow

Solid State Disk Wear Status Change

This is contraditicting and seem to be false alerts, as ESXI shows these disks as healthy, ILO shows them as healthy even SIM shows them as healthy (Health status)

Has anyone came accross this before?

We use latest firmware and P440 controllers in HBA mode

8 Replies
TheBobkin
Champion
Champion

Hello markotsg80​,

While vSAN/ESXi may be capable of keeping these drives mounted and in use (and thus appear 'healthy'), blocks do wear out over time and after extended use are more prone to failure/corruption - SSDs typically cycle the usage of blocks so that they wear more evenly and manufacturers reserve a % space to use to replace failed blocks.

What % wear are your SSDs down to?

Whether considering replacing them at 25-30% or 10% really depends on the criticality of the data/uptime, though I would strongly advise against allowing multiple cache-tier devices to get below these levels as the risk of double-failure or failure during rebuild may be increased.

You can check from ESXi via 'smart' if you are concerned this is a false positive (but then again this likely gathers these stats from the same source).

# esxcli storage core device list

# esxcli storage core device smart get -d <device>

https://kb.vmware.com/s/article/2040405

This can also be generated in graph form for all devices on a host via log bundle collection:

# vm-support -w <directoryForStoringBundle>

# tar -xvf esxiName_support_bundle.tgz

# less ExtractedBundleName/commands/smartinfo.txt

Bob

markotsg80
Enthusiast
Enthusiast

Many thanks

Will try this

From memory when i run the esxcli storage core device smart get -d <device>, most of the details where marked as n/a, apart from status ok.

Will run it again and will atach the screenshot.

Will check the smart.txt file as well from the bundle to see if it gives any more usefull details

0 Kudos
markotsg80
Enthusiast
Enthusiast

Hello Bob

This is what i get when runing esxcli storage core device smart get -d <device> command

Device:  naa.5000cca01d2afaa8

Parameter                     Value  Threshold  Worst

-----------------------------------------------------

Health Status                 N/A    N/A        N/A

Media Wearout Indicator       N/A    N/A        N/A

Write Error Count             0      N/A        N/A

Read Error Count              0      N/A        N/A

Power-on Hours                N/A    N/A        N/A

Power Cycle Count             N/A    N/A        N/A

Reallocated Sector Count      N/A    N/A        N/A

Raw Read Error Rate           N/A    N/A        N/A

Drive Temperature             32     N/A        N/A

Driver Rated Max Temperature  N/A    N/A        N/A

Write Sectors TOT Count       N/A    N/A        N/A

Read Sectors TOT Count        N/A    N/A        N/A

Initial Bad Block Count       N/A    N/A        N/A

0 Kudos
TheBobkin
Champion
Champion

Hello markotsg80,

It's possible you don't have some module installed that allows checking of these 'N/A' parameters, though are you also positive that these devices are being passed as pass-through as opposed to R0? (supported with the correct FW on 440ar)

Either way - the information being passed is going to be getting these from the hardware sensors/counters as I said previously, so you can go by these if you want to know the current wear-level of your devices.

Bob

0 Kudos
markotsg80
Enthusiast
Enthusiast

This is what i get when running the smart capture

controller definitely in HBA mode

0 Kudos
markotsg80
Enthusiast
Enthusiast

----------------------------  -----  ---------  -----

Health Status                 OK     N/A        N/A

Media Wearout Indicator       N/A    N/A        N/A

Write Error Count             0      N/A        N/A

Read Error Count              0      N/A        N/A

Power-on Hours                N/A    N/A        N/A

Power Cycle Count             40     N/A        N/A

Reallocated Sector Count      N/A    N/A        N/A

Raw Read Error Rate           N/A    N/A        N/A

Drive Temperature             23     N/A        N/A

Driver Rated Max Temperature  N/A    N/A        N/A

Write Sectors TOT Count       N/A    N/A        N/A

Read Sectors TOT Count        N/A    N/A        N/A

Initial Bad Block Count       N/A    N/A        N/A

[root@sv230530:/bin]

0 Kudos
TheBobkin
Champion
Champion

Hello markotsg80

"Solid State Disk Wear Status Change"

Are you positive that this isn't just updating every time there is a change to the % remaining or something similar?

How often are you getting these alerts and always on the same drives or varying?

"

Usage remaining: 99.34%

Power On Hours: 3309

Estimated Life Remaining based on workload to date: 20752 days

"

This drive appears to have almost no wear on it and is okay.

"

Device:  naa.5000cca01d2afaa8

Parameter                     Value  Threshold  Worst

-----------------------------------------------------

Health Status                 N/A    N/A        N/A

"

"

Health Status                 OK     N/A        N/A

"

Were these on different drives? (e.g. one capacity-tier, one cache-tier) Strange that it can see Health Status value on one but not other - as I was saying, you may need some other utility other than SIM to see these, HPE have a few AFAIK, maybe see what is available for download (HPE SSA for start).

Bob

0 Kudos
markotsg80
Enthusiast
Enthusiast

Just checked on the server where we received the predictive HP SIM alert, and when you run the esxcli smart, wear % is not showing, just the health status

We can only see number of days left and % usage remaining for SSD drives and not MD drives

interesting we see, 20752 days is listed as usage remaining 99.34%

Usage remaining: 99.34%

Power On Hours: 3309

Estimated Life Remaining based on workload to date: 20752 days

on another host, SSD

Usage remaining: 99%

Power On Hours: 3309

Estimated Life Remaining based on workload to date:9000 days

For .34% difference in usage remaining , large difference in number of days?

0 Kudos