VMware Cloud Community
alexanderleutz
Contributor
Contributor
Jump to solution

local 2 node cluster vsan 6.7u1 power outage test failed (fault domain problem?)

scenario:

2 physical vsan 6.7u1 nodes + 1 witness vm 6.7u1

each physical host in seperate fire sections

all connected via 10gb lan switches on the same campus

witness and vcsa running on other non vsan cluster host in a 3rd fire sections

now we do a lot failure testing, remove disks, remove network, failed hosts, failed power.

the vsan automatically rebuild everytime without needing hands on.

perfekt.

BUT:

when all hosts are shut down, we power on secondary fault domain host and witness the vms didnt restart. all vms are disconnected or inaccessible.

there ist no way to get the vms back online.

why?

if we power on the preferred fault domain host and the witness, the vms fix automatically and restart automatically.

the customer, needs the same "automatic repair" equal which data host is alive

is this is bug or a configuration (or an understanding) error?

best regards

Alexander

Reply
0 Kudos
1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello Alexander,

"when all hosts are shut down, we power on secondary fault domain host and witness the vms didnt restart. all vms are disconnected or inaccessible.

there ist no way to get the vms back online.

why?"

How did you shut down the hosts and were the VMs on the cluster powered-on when this was performed? (e.g. hard drop/MM with No Action etc.)

If the VMs were on, are you positive that the node on Secondary site didn't go down first thus making the data on this node technically stale? (as it missed the writes Primary site committed)

This could be very easily clarified by checking the state of the data under these conditions, e.g. with just Witness and Secondary available all/most of the DOM-Objects would likely have a Config-Status of 28 or some other non 7/15 state indicating stale. This can be checked using:

# cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep 'state\\\":' | sort | uniq -c

If the data was cold when this occurred then it is likely a case of Preferred site selection - if the data is not stale and you can reproduce this, try changing the Preferred Fault Domain to the Secondary site and test if you can power-on VMs.

I am assuming you have no site affinities or 'Must-run' rules applied here, do check this if not sure.

Bob

View solution in original post

Reply
0 Kudos
3 Replies
sk84
Expert
Expert
Jump to solution

What's your FTT setting for the vms that are shown as disconnected or inaccessible?

--- Regards, Sebastian VCP6.5-DCV // VCP7-CMA // vSAN 2017 Specialist Please mark this answer as 'helpful' or 'correct' if you think your question has been answered correctly.
Reply
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Alexander,

"when all hosts are shut down, we power on secondary fault domain host and witness the vms didnt restart. all vms are disconnected or inaccessible.

there ist no way to get the vms back online.

why?"

How did you shut down the hosts and were the VMs on the cluster powered-on when this was performed? (e.g. hard drop/MM with No Action etc.)

If the VMs were on, are you positive that the node on Secondary site didn't go down first thus making the data on this node technically stale? (as it missed the writes Primary site committed)

This could be very easily clarified by checking the state of the data under these conditions, e.g. with just Witness and Secondary available all/most of the DOM-Objects would likely have a Config-Status of 28 or some other non 7/15 state indicating stale. This can be checked using:

# cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep 'state\\\":' | sort | uniq -c

If the data was cold when this occurred then it is likely a case of Preferred site selection - if the data is not stale and you can reproduce this, try changing the Preferred Fault Domain to the Secondary site and test if you can power-on VMs.

I am assuming you have no site affinities or 'Must-run' rules applied here, do check this if not sure.

Bob

Reply
0 Kudos
alexanderleutz
Contributor
Contributor
Jump to solution

Thank you all and happy christmas!

After doing only one failover test after the other, and with patience, and delay between them, everything works fine (and automatically) Smiley Happy

Best regards,

Alexander

Reply
0 Kudos