vSAN1

 View Only
Expand all | Collapse all

Physical Disk: how to monitor.

  • 1.  Physical Disk: how to monitor.

    Posted Mar 31, 2014 09:12 AM

    Hi,

    I'm using Dell servers, three of them,  with per hcl

    - perc h200 hba

    - 200GB SSD SAS Mixed Used  ( OEM Pliant/Sandisk Lightning LB206M)

    Now I've a disk group in unhealthy status caused by a SSD disk in "permanent disk failure". I understand that replacing the disk I'm going to lose the disk group.

    My question is how to monitor the status to have an advance warning and, in case, how to test the device for failure.

    Thanks for any suggestion,

    Giosuè



  • 2.  RE: Physical Disk: how to monitor.

    Posted Apr 01, 2014 10:22 PM

    Giosuè

    The SSD and Disks should be "SMART" capable, and since you are using the Dell H200 they are in "Passthrough" mode (Only mode supported for VSAN on that particular controller) so we should see the characteristics of those disks.  So in theory those drives should alert us when the SMART threshold has been reached and this should bubble up through the Hardware Status tab of the ESXi host.

    I haven't seen a drive do this yet (As I have not seen a SMART failure being tripped)

    There may be another way to pull the information manually, I will check this and come back on the thread tomorrow

    Simon



  • 3.  RE: Physical Disk: how to monitor.

    Posted Apr 02, 2014 10:12 AM

    Hi,

    thanks for your reply.

    Yes the disks are in "pass-through" mode, but the hardware sensor displays no data.

    I'm using a LSI cim provider as in Monitoring Dell PowerEdge RAID Controllers in VMware ESXi

    Cluster Disk management view:

    Host hardware status view:

    After a reboot the disk is going to be OK for a while, but when used the failure reappears.

    Thank in advance,

    Giosuè



  • 4.  RE: Physical Disk: how to monitor.

    Posted Apr 02, 2014 10:34 AM

    Giosuè

    I would recommend if you have a VSAN license to open a support request with us, it may well come down to the hardware not giving us the information, but certainly our engineering team would be interested in this as well

    Regards

    Simon



  • 5.  RE: Physical Disk: how to monitor.

    Posted Apr 02, 2014 12:17 PM

    Sorry, I'm evaluating Virtual SAN for a VDI project, I haven't a licence for now.

    Thanks

    Ciao

    Giosuè



  • 6.  RE: Physical Disk: how to monitor.

    Posted Apr 02, 2014 12:32 PM

    Hi Giosuè

    Going off the model of the SSD I would say it is not one that is certified for VSAN usage, the LB406M by SanDisk has been certified but not the LB206M, so it may well be the case that the SSD does not have the information it needs to provide to us.

    Regards

    Simon



  • 7.  RE: Physical Disk: how to monitor.

    Posted Apr 02, 2014 01:32 PM

    Hi Simon,

    i was very interested in your above statement about SMART and the capability of ESXi to use this.

    BUT you also support RAID0 Mode on DELL H710P RAID Controller. It is on your HCL for VSAN. NOW: This controller will NOT let ESXi gain control of SMART data. So - my question would be, is the H710P really fully supported with RAID0 Mode or do you say something like "errm yep we support it but we´d like you to use pass-through-capable controllers if you can"?

    Best regards,

    Joerg



  • 8.  RE: Physical Disk: how to monitor.

    Posted Apr 02, 2014 01:38 PM

    Hi,

    thanks for your collaboration

    The SSD should be a

    DELL200GB SSD SAS Mixed Used/Value 6Gbps 2.5inSASESXi 5.5 U1

    The original manufacturer is SanDisk

    Best regards,

    Giosuè



  • 9.  RE: Physical Disk: how to monitor.

    Posted Apr 02, 2014 01:58 PM

    GiosuePacifico

    Now that is interesting, the reason I say this is because the SanDisk LB406M is a 400GB version and the LB206M is a 200GB version of the same family, however SanDisk never certified the LB206M but Dell Certified their own version, if you run a vm-support on that particular host, inside the tgz file you will find a commands folder and in there a smartinfo.sh.txt file, paste the contents of that file for the SSD, it should look something like the following:

    Device:  t10.ATA_____ST31500341AS________________________________________9VS0HGWZ

    Parameter                     Value  Threshold  Worst

    -----------------------------------------------------

    Health Status                 OK     N/A        N/A

    Media Wearout Indicator       N/A    N/A        N/A

    Write Error Count             N/A    N/A        N/A

    Read Error Count              117    6          99

    Power-on Hours                64     0          64

    Power Cycle Count             100    20         100

    Reallocated Sector Count      99     36         99

    Raw Read Error Rate           117    6          99

    Drive Temperature             38     0          51

    Driver Rated Max Temperature  62     45         49

    Write Sectors TOT Count       200    0          200

    Read Sectors TOT Count        N/A    N/A        N/A

    Initial Bad Block Count       99     99         99

    joergriether Yes you can use that controller with Virutal SAN RAID 0, and yes unfortunately we will not see the underlying characteristics of the disks, in that case we'd be relying on the RAID controller to pass up the predictive failure information

    Simon



  • 10.  RE: Physical Disk: how to monitor.

    Posted Apr 02, 2014 02:33 PM

    Here it is

    SMART Information for disks.

    Device:  naa.5001e8200272f504

    Parameter                     Value  Threshold  Worst

    -----------------------------------------------------

    Health Status                 N/A    N/A        N/A

    Media Wearout Indicator       N/A    N/A        N/A

    Write Error Count             0      N/A        N/A

    Read Error Count              0      N/A        N/A

    Power-on Hours                N/A    N/A        N/A

    Power Cycle Count             N/A    N/A        N/A

    Reallocated Sector Count      N/A    N/A        N/A

    Raw Read Error Rate           N/A    N/A        N/A

    Drive Temperature             34     N/A        N/A

    Driver Rated Max Temperature  N/A    N/A        N/A

    Write Sectors TOT Count       N/A    N/A        N/A

    Read Sectors TOT Count        N/A    N/A        N/A

    Initial Bad Block Count       N/A    N/A        N/A

    OK, no smart at all ;-)

    Giosuè



  • 11.  RE: Physical Disk: how to monitor.
    Best Answer

    Posted Apr 02, 2014 02:54 PM

    Giosuè

    As I suspected, your SSD there does not have any SMART Characteristics reported apart from the Current Temperature and Read/Write Error Count, but no thresholds, now this means one of two things

    1. The drive does not support these enhanced SMART features

    2. The implementation of the features is not the standard method of SMART integration

    So it would be a question for Dell to take to their Product Engineering and for them to Engage with SanDisk, I still think it is strange that SanDisk have not certified that model but Dell have

    So with this drive, it would have to fail before VSAN knows anything is wrong.  At least now you know now

    Simon



  • 12.  RE: Physical Disk: how to monitor.

    Posted Apr 02, 2014 03:20 PM

    Ok, I was suspecting something like this.

    I will swap the controller with a full LSI 92xx and check again.

    Thanks

    ciao

    Giosuè



  • 13.  RE: Physical Disk: how to monitor.

    Posted Apr 02, 2014 06:18 PM

    Giosuè

    Just remember, it might not be down to the controller, it might actually be down to the SSD

    Simon



  • 14.  RE: Physical Disk: how to monitor.

    Posted Apr 02, 2014 07:02 PM

    Hi,

    I'm well aware. I'm wondering about the other disks, HDD 1.2 TB, their SMART status is N/A too.

    So I'm (wishfully) thinking the culprit is the controller.

    Next week I should receive the LSI controller and I'll come back.

    Thank again

    Giosuè

    Device:  naa.5000cca01d2afaa8

    Parameter                     Value  Threshold  Worst

    -----------------------------------------------------

    Health Status                 N/A    N/A        N/A

    Media Wearout Indicator       N/A    N/A        N/A

    Write Error Count             0      N/A        N/A

    Read Error Count              0      N/A        N/A

    Power-on Hours                N/A    N/A        N/A

    Power Cycle Count             N/A    N/A        N/A

    Reallocated Sector Count      N/A    N/A        N/A

    Raw Read Error Rate           N/A    N/A        N/A

    Drive Temperature             32     N/A        N/A

    Driver Rated Max Temperature  N/A    N/A        N/A

    Write Sectors TOT Count       N/A    N/A        N/A

    Read Sectors TOT Count        N/A    N/A        N/A

    Initial Bad Block Count       N/A    N/A        N/A



  • 15.  RE: Physical Disk: how to monitor.

    Posted Apr 14, 2014 11:17 AM

    Hi,

    I was wrong. The controller is innocent ;-)

    I replaced the PERC H200 with a LSI SAS9211-8i, it's the same chipset SAS2008, but the LSI has a  PCIe 8x connector.

    I upgraded the firmware and the driver as per HCL

    # esxcfg-scsidevs -a

    vmhba0  mpt2sas           link-n/a  sas.500605b006081990                    (0:8:0.0) LSI Logic / Symbios Logic LSI2008

    # vmkload_mod -s mpt2sas |grep Version

    Version: Version 18.00.00.00.1vmw, Build: 472560, Interface: 9.2 Built on: Nov 21 2013

    # /opt/lsi/bin/sas2flash -list

    LSI Corporation SAS2 Flash Utility

    Version 18.00.0.00 (2013.11.18)

    Copyright (c) 2008-2013 LSI Corporation. All rights reserved

            Adapter Selected is a LSI SAS: SAS2008(B2)

            Controller Number              : 0

            Controller                     : SAS2008(B2)

            PCI Address                    : 00:08:00:00

            SAS Address                    : 500605b-0-0608-1990

            NVDATA Version (Default)       : 11.00.00.08

            NVDATA Version (Persistent)    : 11.00.00.08

            Firmware Product ID            : 0x2713 (IR)

            Firmware Version               : 18.00.00.00

            NVDATA Vendor                  : LSI

            NVDATA Product ID              : SAS9211-8i

            BIOS Version                   : 07.35.00.00

            UEFI BSD Version               : 07.22.01.00

            FCODE Version                  : N/A

            Board Name                     : SAS9211-8i

            Board Assembly                 : H3-25250-02F

            Board Tracer Number            : SP30360316

            Finished Processing Commands Successfully.

            Exiting SAS2Flash.


    Than I checked the S.M.A.R.T.


    # esxcli storage core device list | grep Display

       Display Name: Pliant Serial Attached SCSI Disk (naa.5001e8200272f504) 

    VMware KB: ESXi S.M.A.R.T. health monitoring for hard drives

    # esxcli storage core device smart get -d naa.5001e8200272f504

    Parameter                     Value  Threshold  Worst

    ----------------------------  -----  ---------  -----

    Health Status                 N/A    N/A        N/A

    Media Wearout Indicator       N/A    N/A        N/A

    Write Error Count             0      N/A        N/A

    Read Error Count              0      N/A        N/A

    Power-on Hours                N/A    N/A        N/A

    Power Cycle Count             N/A    N/A        N/A

    Reallocated Sector Count      N/A    N/A        N/A

    Raw Read Error Rate           N/A    N/A        N/A

    Drive Temperature             34     N/A        N/A

    Driver Rated Max Temperature  N/A    N/A        N/A

    Write Sectors TOT Count       N/A    N/A        N/A

    Read Sectors TOT Count        N/A    N/A        N/A

    Initial Bad Block Count       N/A    N/A        N/A

    Still no smart information

    I'll ask Dell for a replacement.

    Thanks

    Giosuè