scenario:
2 physical vsan 6.7u1 nodes + 1 witness vm 6.7u1
each physical host in seperate fire sections
all connected via 10gb lan switches on the same campus
witness and vcsa running on other non vsan cluster host in a 3rd fire sections
now we do a lot failure testing, remove disks, remove network, failed hosts, failed power.
the vsan automatically rebuild everytime without needing hands on.
perfekt.
BUT:
when all hosts are shut down, we power on secondary fault domain host and witness the vms didnt restart. all vms are disconnected or inaccessible.
there ist no way to get the vms back online.
why?
if we power on the preferred fault domain host and the witness, the vms fix automatically and restart automatically.
the customer, needs the same "automatic repair" equal which data host is alive
is this is bug or a configuration (or an understanding) error?
best regards
Alexander
Hello Alexander,
"when all hosts are shut down, we power on secondary fault domain host and witness the vms didnt restart. all vms are disconnected or inaccessible.
there ist no way to get the vms back online.
why?"
How did you shut down the hosts and were the VMs on the cluster powered-on when this was performed? (e.g. hard drop/MM with No Action etc.)
If the VMs were on, are you positive that the node on Secondary site didn't go down first thus making the data on this node technically stale? (as it missed the writes Primary site committed)
This could be very easily clarified by checking the state of the data under these conditions, e.g. with just Witness and Secondary available all/most of the DOM-Objects would likely have a Config-Status of 28 or some other non 7/15 state indicating stale. This can be checked using:
# cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep 'state\\\":' | sort | uniq -c
If the data was cold when this occurred then it is likely a case of Preferred site selection - if the data is not stale and you can reproduce this, try changing the Preferred Fault Domain to the Secondary site and test if you can power-on VMs.
I am assuming you have no site affinities or 'Must-run' rules applied here, do check this if not sure.
Bob
What's your FTT setting for the vms that are shown as disconnected or inaccessible?
Hello Alexander,
"when all hosts are shut down, we power on secondary fault domain host and witness the vms didnt restart. all vms are disconnected or inaccessible.
there ist no way to get the vms back online.
why?"
How did you shut down the hosts and were the VMs on the cluster powered-on when this was performed? (e.g. hard drop/MM with No Action etc.)
If the VMs were on, are you positive that the node on Secondary site didn't go down first thus making the data on this node technically stale? (as it missed the writes Primary site committed)
This could be very easily clarified by checking the state of the data under these conditions, e.g. with just Witness and Secondary available all/most of the DOM-Objects would likely have a Config-Status of 28 or some other non 7/15 state indicating stale. This can be checked using:
# cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep 'state\\\":' | sort | uniq -c
If the data was cold when this occurred then it is likely a case of Preferred site selection - if the data is not stale and you can reproduce this, try changing the Preferred Fault Domain to the Secondary site and test if you can power-on VMs.
I am assuming you have no site affinities or 'Must-run' rules applied here, do check this if not sure.
Bob
Thank you all and happy christmas!
After doing only one failover test after the other, and with patience, and delay between them, everything works fine (and automatically)
Best regards,
Alexander