Solved: vSAN DR with one Site

mhu1990 · ‎08-01-2022

Hello guys,

we got a new brand new AF-VSAN Cluster with 5 Nodes and RAID5. This is located in one of our two datacenter (DC) .

There is a "plannend" Desaster Recovery (DR) at the weekend and I've a little bit scared about the vsan Cluster and what will happening with the data.

The sceneario looks so that they power off all Access/Core Switches that are no communciation is between the ESX-Host.

My Questions:

1) What happening with the VMs on the vSAN Cluster? Should I power off the vm and the host in mainteance mode before the date or are the vms are only "read only" mode and after a restart they should usally works?

2) What will happen with the shared vSAN Datastore? Only inaccessible?

3) When the DR are over, is the sync manually oder autmaticlly? Should i do something?

Did you have any experience with this?

Regards,

Martin

TheBobkin · ‎08-01-2022

@mhu1990

1. They are going down due to their backing vmdks and namespaces becoming inaccessible. Whether you want to cleanly shut things down and enter MM depends whether you want to emulate what a real crash reacts like or not (I would advise NOT to, to emulate properly and so you can put more accurate provisions/plan in place). Once the data becomes accessible again HA should kick in (assuming you have it enabled and properly configured) and restart all the VMs (but do check any that are possibly but unlikely not marked as inaccessible and need manual restart).

2. Do you mean HCI Mesh sharing this vsanDatastore to another cluster? If so, any objects on this will become inaccessible and should react the same way as 1., if you are referring to the vsanDatastore in general from this clusters nodes perspective it should have 0B capacity and nothing accessible from it (as all the namespaces will be inaccessible).

3. Nope, no manual intervention should be needed.

"Did you have any experience with this?"
Sure, I had the pleasure of working on vSAN GS EMEA P1 team for many years and have probably seen more recoveries/fixes from network outage than probably anyone, fair enough that a lot of network outages are not so clean as off then on so yours would probably cleaner than how this goes down (and back up) in realistic settings but all the same it should be a valid test.

View solution in original post

TheBobkin · ‎08-01-2022

@mhu1990

1. They are going down due to their backing vmdks and namespaces becoming inaccessible. Whether you want to cleanly shut things down and enter MM depends whether you want to emulate what a real crash reacts like or not (I would advise NOT to, to emulate properly and so you can put more accurate provisions/plan in place). Once the data becomes accessible again HA should kick in (assuming you have it enabled and properly configured) and restart all the VMs (but do check any that are possibly but unlikely not marked as inaccessible and need manual restart).

2. Do you mean HCI Mesh sharing this vsanDatastore to another cluster? If so, any objects on this will become inaccessible and should react the same way as 1., if you are referring to the vsanDatastore in general from this clusters nodes perspective it should have 0B capacity and nothing accessible from it (as all the namespaces will be inaccessible).

3. Nope, no manual intervention should be needed.

"Did you have any experience with this?"
Sure, I had the pleasure of working on vSAN GS EMEA P1 team for many years and have probably seen more recoveries/fixes from network outage than probably anyone, fair enough that a lot of network outages are not so clean as off then on so yours would probably cleaner than how this goes down (and back up) in realistic settings but all the same it should be a valid test.

mhu1990 · ‎08-02-2022

Hi thebobkin,

thanks for the long explaination from you.

2) No, I meant only the local VSAN Store, not a mesh infrastructure

I'm excited on saturday, what really will happen with the cluster :).

Regards,

Martin

TheBobkin · ‎08-03-2022

@mhu1990, If you consider that a long explanation you should see my average email summarising a Zoom session! 😂

Just an addendum regarding 3. - you should validate that the cluster fully reforms and is otherwise healthy once the network comes back, this should just be a case of checking Skyline Health (Cluster > Monitor > vSAN > Skyline Health > Retest) - if the vCenter managing this cluster is running on this cluster (not ideal) then you will have to wait a few minutes for this to come back up and all services started before this is possible.

If that is the case and for any reason vCenter doesn't start and/or other issues then the first place to start would be from SSH on any node in the cluster 'esxcli vsan cluster get' to validate full member count (and whom is out of cluster if not all in) and 'esxcli vsan debug object health summary get' to validate whether all objects are healthy or not.

'esxcli vsan health cluster list' is also a handy node-based fill-in for Skyline Health but note that this can have false positives for 'network health' in some builds and/or circumstances (certs and cert thumbprints being a common culprit).

"I'm excited on saturday, what really will happen with the cluster "

Crap, just remembered I am scheduled to cover P1 shift this Saturday, can this be rescheduled? (joke)

Do let us know how it goes though, I am sure this is kind of topic interests most of the people that frequent this sub-Community.

mhu1990 · ‎08-04-2022

Hi,

no the vCenter doesn't run on the vSAN Cluster itself. It runs on a another traditional clusuter (computing/shared storage).

Thanks for the explaination.

I will report next Week how it went.

mhu1990 · ‎08-10-2022

Hey,

it works so, as you described. Datastores and Capacity were at 0KB, after power on off the switches the capacity come back. I've restarted all VMs and the production works again. It was relly smooth! Thank you.

All

vSAN DR with one Site