4 Replies Latest reply on Dec 18, 2017 8:32 PM by hkg2581

    Operation time out when removing disk from vSan

    tinnh Lurker

      I had a SSD cache disk failed, i have tried to remove it to handle but stuck at operation time out. I tried two option Full data miration and Ensure accessibilty, both didn't work.

      An event appeared before time out event.

      How to resolve this annoying issue and i wonder if it's safe completely to physically replace failed SSD disk.

      Capture.JPG

        • 1. Re: Operation time out when removing disk from vSan
          TheBobkin Master
          VMware EmployeesvExpert

          Hello tinnh,

           

           

          The first thing to do in this situation is to verify the data health e.g. that all Objects that had components residing on the failed Disk-Group have been rebuilt on the remaining nodes/DGs in the cluster (provided there is an adequate number of available Fault Domains and space).

          This can be verified via the Web Client - Cluster > Monitor > Health > Data

          or via the CLI on the host using cmmds-tool e.g. this prints the number of Objects with each Config Status (state 7 = Healthy):

          #cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c

           

          If all data has resynced and is healthy then it *should* be safe to remove the disk-group via alternative methods either by deleting the disk-group with 'No Action' or wiping the partitions on the drives via the Web Client (Host > Configure > Storage Devices > Select Device > All Actions > Erase Partitions NOTE: CAREFULLY check that the correct drives are being worked on here as this is PERMANENT).

          If neither of the above is possible (hostd can hold locks on badly failed disks) then the remaining option would be to boot the host with the vSAN modules disabled and then wipe the partitions.

          If this is not a lab-cluster and if possible, do open a support request with VMware GSS and/or proceed with caution here.

           

          Hope this helps.

           

           

          Bob

          • 2. Re: Operation time out when removing disk from vSan
            tinnh Lurker

            Hi Bob,

             

            Thanks for your advices. I followed your instruction to check data health and it shows as below

            Capture2.JPG

            I tried to Repair Object Immidiately and check resync status but it is empty even do a refresh, it seems not to happen any resynchronization.

            Capture3.JPG

            Do you have any idea?

             

            Regards!

            • 3. Re: Operation time out when removing disk from vSan
              TheBobkin Master
              vExpertVMware Employees

              Hello tinnh,

               

               

              This may be unable to resync the data due to insufficient space on the appropriate fault domains due to the disk-group with issues.

              How many node and disk-groups in this cluster?

              What is the space-utilisation per disk as per RVC? (vsan.disks_stats <pathToCluster>)

               

              If no VMs/data is currently inaccessible then it is likely safe to remove this disk-group (via partition-wipe) and rebuild the disk-group - double-check that this disk-group is properly failed using esxtop 'u' and verify that there are 0 IOs to all devices in this disk-group.

               

              As I said previously - if you do have the ability to open an SR with VMware GSS please do this as someone can check this better live via WebEx than I can advise without seeing the cluster live.

               

               

              Bob

              • 4. Re: Operation time out when removing disk from vSan
                hkg2581 Novice
                VMware Employees

                tinnh

                 

                Please raise a support ticket with Vmware for a TSE to review if this is a production cluster , please refrain from deleting an disk group with no ata migration , you may cause a potential data loss . I see that you have multiple objects with reduced availability and non-compliance .