1 Reply Latest reply on Jan 13, 2018 6:24 AM by TheBobkin

    VSAN - reclaim disks with data on them

    catalystjmf Lurker

      The servers are HP DL380 Gen 8s with P420i and all drives are in raid 0 per drive.

      VMWARE 6.5

       

      We have a situation that is fairly complex, but it started out with 3 drives failing in 2 vsan hosts at the same time. It was hard to know exactly what was going on because vsphere was very unresponsive and would not display good data or info. One server would not boot, and at some point the raid 0 logical drives for the failed drives was removed from the raid controller config. The server eventually booted up. We sent the drives off to ontrack and they did some magic. We got the drives back and thye kept a copy of the data. We put the drives back into the server, the raid controller could see them again so that was good. We booted the server and esxi sees the drives. That's good. VSAN however wants nothing to do with the drives and we are definitely not going to add them back in as that would likely reformat them. So how on earth do we tell vsan that all those inaccessible objects are on those drives? Thanks for any ideas!

        • 1. Re: VSAN - reclaim disks with data on them
          TheBobkin Master
          vExpertVMware Employees

          Hello catalystjmf

           

           

          "it started out with 3 drives failing in 2 vsan hosts at the same time."

          Capacity-tier or cache-tier devices?

           

          "It was hard to know exactly what was going"

          vmkernel.log, vobd.log and vmkwarning.log from the hosts these drives were on should provide information as the cause and nature of the failures.

           

          "One server would not boot"

          Did it 'stall' for a significant period of time in 'SSD Initialization' and if so how long did you wait? (this can take hours if disk-groups are in a bad state)

          Or was a drive that the hosts boot partition is located on affected?

           

          "at some point the raid 0 logical drives for the failed drives was removed from the raid controller config."

          By whom and how?

           

          "We sent the drives off to ontrack and they did some magic."

          Were Kroll made aware that these drives were part of a vSAN cluster and what guidance or information did they provide relating to the viability of recovering this data and how to integrate these back into the diskgroups?

           

          "We put the drives back into the server"

          Did Kroll manage to repair the original drives or did they clone the data off onto new drives?

           

          "the raid controller could see them again so that was good"

          Were you somehow able to add the drive to their original individual RAID0 volumes?

           

          "VSAN however wants nothing to do with the drives and we are definitely not going to add them back in as that would likely reformat them."

          vSAN can't consume new disks with partitions on them and ESXi/vSAN won't reformat added drives unless you tell it to.

          Are the partitions on the drives detected as intact vSAN partitions?

          This can be checked using partedUtil getptbl or via the Web Client:

          Host > Manage > Storage Adapters > Select drive > partitions listed in lower-pane.

           

          "So how on earth do we tell vsan that all those inaccessible objects are on those drives?"

          It *may* be possible to add the disks back to existing disk-groups but this depends on a number of factors, some of which I have asked for clarification of above.

          It may also be feasible to repair some/all of the inaccessible Objects but this depends on what components they lost and their FTM (Fault Tolerance Method e.g. RAID1=may be feasible, RAID5/6=nope).

           

          I strongly advise you open an SR with VMware GSS vSAN-team if you have not done so already, it probably would have been best to do this before pulling the drives out.

           

          Can you attach a dump of the current cmmds output?

          I may be able to determine the viability of recovery options from looking at the current state of the Objects.

          From any host run this to write this output to the text file specified:

          # cmmds-tool find -f json > /tmp/cmmds_dump_output.txt

           

          You can also do this yourself from looking at the Web Client

          Cluster > Monitor > Virtual disks > Select the Inaccessible Objects and look at which components are still healthy, if they still have an accessible and complete RAID0 set of data-components then they *should* be repairable by GSS.

           

           

          Bob