3 Replies Latest reply on Dec 20, 2018 11:23 PM by alexanderleutz

    local 2 node cluster  vsan 6.7u1 power outage test failed (fault domain problem?)

    alexanderleutz Novice

      scenario:

      2 physical vsan 6.7u1 nodes + 1 witness vm 6.7u1

      each physical host in seperate fire sections

      all connected via 10gb lan switches on the same campus

      witness and vcsa running on other non vsan cluster host in a 3rd fire sections

       

      now we do a lot failure testing, remove disks, remove network, failed hosts, failed power.

      the vsan automatically rebuild everytime without needing hands on.

      perfekt.

       

      BUT:

      when all hosts are shut down, we power on secondary fault domain host and witness the vms didnt restart. all vms are disconnected or inaccessible.

      there ist no way to get the vms back online.

      why?

       

      if we power on the preferred fault domain host and the witness, the vms fix automatically and restart automatically.

       

      the customer, needs the same "automatic repair" equal which data host is alive

       

      is this is bug or a configuration (or an understanding) error?

       

      best regards

      Alexander

        • 1. Re: local 2 node cluster  vsan 6.7u1 power outage test failed (fault domain problem?)
          sk84 Expert
          vExpert

          What's your FTT setting for the vms that are shown as disconnected or inaccessible?

          • 2. Re: local 2 node cluster  vsan 6.7u1 power outage test failed (fault domain problem?)
            TheBobkin Virtuoso
            VMware EmployeesvExpert

            Hello Alexander,

             

             

            "when all hosts are shut down, we power on secondary fault domain host and witness the vms didnt restart. all vms are disconnected or inaccessible.

            there ist no way to get the vms back online.

            why?"

            How did you shut down the hosts and were the VMs on the cluster powered-on when this was performed? (e.g. hard drop/MM with No Action etc.)

            If the VMs were on, are you positive that the node on Secondary site didn't go down first thus making the data on this node technically stale? (as it missed the writes Primary site committed)

            This could be very easily clarified by checking the state of the data under these conditions, e.g. with just Witness and Secondary available all/most of the DOM-Objects would likely have a Config-Status of 28 or some other non 7/15 state indicating stale. This can be checked using:

            # cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep 'state\\\":' | sort | uniq -c

             

            If the data was cold when this occurred then it is likely a case of Preferred site selection - if the data is not stale and you can reproduce this, try changing the Preferred Fault Domain to the Secondary site and test if you can power-on VMs.

            I am assuming you have no site affinities or 'Must-run' rules applied here, do check this if not sure.

             

             

            Bob

            • 3. Re: local 2 node cluster  vsan 6.7u1 power outage test failed (fault domain problem?)
              alexanderleutz Novice

              Thank you all and happy christmas!

               

              After doing only one failover test after the other, and with patience, and delay between them, everything works fine (and automatically)

               

              Best regards,

              Alexander