vSAN1

 View Only
  • 1.  Solid State Disk Wear Status Change alerts on DL380 G9

    Posted Feb 11, 2018 08:23 PM

    We have DL380 G9 servers and run ESXI 6.0 with VSAN 6.2

    .

    We use HP Sim for monitoring and SIM seem to be giving us lot of predictive failiure alerts (mainly for ssd drives) including alerts like bellow

    Solid State Disk Wear Status Change

    This is contraditicting and seem to be false alerts, as ESXI shows these disks as healthy, ILO shows them as healthy even SIM shows them as healthy (Health status)

    Has anyone came accross this before?

    We use latest firmware and P440 controllers in HBA mode



  • 2.  RE: Solid State Disk Wear Status Change alerts on DL380 G9

    Posted Feb 11, 2018 09:32 PM

    Hello markotsg80​,

    While vSAN/ESXi may be capable of keeping these drives mounted and in use (and thus appear 'healthy'), blocks do wear out over time and after extended use are more prone to failure/corruption - SSDs typically cycle the usage of blocks so that they wear more evenly and manufacturers reserve a % space to use to replace failed blocks.

    What % wear are your SSDs down to?

    Whether considering replacing them at 25-30% or 10% really depends on the criticality of the data/uptime, though I would strongly advise against allowing multiple cache-tier devices to get below these levels as the risk of double-failure or failure during rebuild may be increased.

    You can check from ESXi via 'smart' if you are concerned this is a false positive (but then again this likely gathers these stats from the same source).

    # esxcli storage core device list

    # esxcli storage core device smart get -d <device>

    https://kb.vmware.com/s/article/2040405

    This can also be generated in graph form for all devices on a host via log bundle collection:

    # vm-support -w <directoryForStoringBundle>

    # tar -xvf esxiName_support_bundle.tgz

    # less ExtractedBundleName/commands/smartinfo.txt

    Bob



  • 3.  RE: Solid State Disk Wear Status Change alerts on DL380 G9

    Posted Feb 11, 2018 09:52 PM

    Many thanks

    Will try this

    From memory when i run the esxcli storage core device smart get -d <device>, most of the details where marked as n/a, apart from status ok.

    Will run it again and will atach the screenshot.

    Will check the smart.txt file as well from the bundle to see if it gives any more usefull details



  • 4.  RE: Solid State Disk Wear Status Change alerts on DL380 G9

    Posted Feb 11, 2018 10:20 PM

    Hello Bob

    This is what i get when runing esxcli storage core device smart get -d <device> command

    Device:  naa.5000cca01d2afaa8

    Parameter                     Value  Threshold  Worst

    -----------------------------------------------------

    Health Status                 N/A    N/A        N/A

    Media Wearout Indicator       N/A    N/A        N/A

    Write Error Count             0      N/A        N/A

    Read Error Count              0      N/A        N/A

    Power-on Hours                N/A    N/A        N/A

    Power Cycle Count             N/A    N/A        N/A

    Reallocated Sector Count      N/A    N/A        N/A

    Raw Read Error Rate           N/A    N/A        N/A

    Drive Temperature             32     N/A        N/A

    Driver Rated Max Temperature  N/A    N/A        N/A

    Write Sectors TOT Count       N/A    N/A        N/A

    Read Sectors TOT Count        N/A    N/A        N/A

    Initial Bad Block Count       N/A    N/A        N/A



  • 5.  RE: Solid State Disk Wear Status Change alerts on DL380 G9

    Posted Feb 12, 2018 12:25 AM

    Hello markotsg80,

    It's possible you don't have some module installed that allows checking of these 'N/A' parameters, though are you also positive that these devices are being passed as pass-through as opposed to R0? (supported with the correct FW on 440ar)

    Either way - the information being passed is going to be getting these from the hardware sensors/counters as I said previously, so you can go by these if you want to know the current wear-level of your devices.

    Bob



  • 6.  RE: Solid State Disk Wear Status Change alerts on DL380 G9

    Posted Feb 12, 2018 10:41 AM
      |   view attached

    This is what i get when running the smart capture

    controller definitely in HBA mode



  • 7.  RE: Solid State Disk Wear Status Change alerts on DL380 G9

    Posted Feb 12, 2018 11:26 AM

    ----------------------------  -----  ---------  -----

    Health Status                 OK     N/A        N/A

    Media Wearout Indicator       N/A    N/A        N/A

    Write Error Count             0      N/A        N/A

    Read Error Count              0      N/A        N/A

    Power-on Hours                N/A    N/A        N/A

    Power Cycle Count             40     N/A        N/A

    Reallocated Sector Count      N/A    N/A        N/A

    Raw Read Error Rate           N/A    N/A        N/A

    Drive Temperature             23     N/A        N/A

    Driver Rated Max Temperature  N/A    N/A        N/A

    Write Sectors TOT Count       N/A    N/A        N/A

    Read Sectors TOT Count        N/A    N/A        N/A

    Initial Bad Block Count       N/A    N/A        N/A

    [root@sv230530:/bin]



  • 8.  RE: Solid State Disk Wear Status Change alerts on DL380 G9

    Posted Feb 12, 2018 08:34 PM

    Hello markotsg80

    "Solid State Disk Wear Status Change"

    Are you positive that this isn't just updating every time there is a change to the % remaining or something similar?

    How often are you getting these alerts and always on the same drives or varying?

    "

    Usage remaining: 99.34%

    Power On Hours: 3309

    Estimated Life Remaining based on workload to date: 20752 days

    "

    This drive appears to have almost no wear on it and is okay.

    "

    Device:  naa.5000cca01d2afaa8

    Parameter                     Value  Threshold  Worst

    -----------------------------------------------------

    Health Status                 N/A    N/A        N/A

    "

    "

    Health Status                 OK     N/A        N/A

    "

    Were these on different drives? (e.g. one capacity-tier, one cache-tier) Strange that it can see Health Status value on one but not other - as I was saying, you may need some other utility other than SIM to see these, HPE have a few AFAIK, maybe see what is available for download (HPE SSA for start).

    Bob



  • 9.  RE: Solid State Disk Wear Status Change alerts on DL380 G9

    Posted Feb 13, 2018 11:54 AM

    Just checked on the server where we received the predictive HP SIM alert, and when you run the esxcli smart, wear % is not showing, just the health status

    We can only see number of days left and % usage remaining for SSD drives and not MD drives

    interesting we see, 20752 days is listed as usage remaining 99.34%

    Usage remaining: 99.34%

    Power On Hours: 3309

    Estimated Life Remaining based on workload to date: 20752 days

    on another host, SSD

    Usage remaining: 99%

    Power On Hours: 3309

    Estimated Life Remaining based on workload to date:9000 days

    For .34% difference in usage remaining , large difference in number of days?