VMware Cloud Community
bulabog
Enthusiast
Enthusiast
Jump to solution

2 node direct VSAN maintenance

Hi All,

Just wondering if someone can provide some helpful advise. We have  a 2 node direct connect VSAN cluster. We wanted to rebuild one of the ESXi host (reinstall) and are trying to put the host into maintenance mode and selecting " Full  Evacuation" so we can remove it from the VSAN cluster and rebuild it. I understand that during this time we will lose redundancy if only one host plust the witness remains but that is fine. But we are getting the error:

"Failed to enter maintenance mode in the current VSAN data migration mode due to insufficient nodes or disks in the cluster. Retry operation in another mode or after adding more resources to the cluster"

Are we doing this wrong?

Many thanks

Reply
0 Kudos
1 Solution

Accepted Solutions
bulabog
Enthusiast
Enthusiast
Jump to solution

Hi All,

Thank you for your inputs. Much appreciated. I've managed to get this to work by doing the following:

1. Put the host into maintenance mode, select "Ensure Data accessibility...." instead of the "Full Evacuate Data"

2. Once in maintenance mode, I then go to the VSAN disk groups and delete the Disk Group for the host I am working on.

3. After waiting for VSAN to do its bit, the cluster is now running in non-compliance mode (my VMs are still running and working but non-compliant).

4. I then exit maintenance mode on the host, then put it in maintenance mode again this time selecting "Full Evacuate Data"

5. Move the host out of the cluster and remove from inventory.

I was able to rebuild the host from scratch and add back to the VSAN cluster...During this time, we knew the risk of the vsan cluster running from 1 node (+ witness) but that is fine in our case.

I don't know why it couldn't do all of the above automatically when selecting  "Full Evacuate Data" the first time though...

Thank you.

View solution in original post

Reply
0 Kudos
6 Replies
daphnissov
Immortal
Immortal
Jump to solution

The Full evacuation option is the problem as you can't evacuate data from one data node in a 2-node cluster. You'll have to "ensure accessibility" instead which attempt won't move data to the other node.

Reply
0 Kudos
bulabog
Enthusiast
Enthusiast
Jump to solution

thanks... does this mean we would need to add a (temporary) 3rd host before we can remove this host?

Reply
0 Kudos
perthorn
Enthusiast
Enthusiast
Jump to solution

Remember that a 3 node cluster is the absolute minimum to be able to use the storage policy FTT=1 (a 2 node cluster + witness is actually a 3 node cluster). To be able to completely evacuate one of the hosts while keeping the vSAN objects FTT=1 you would have to add another host. The alternative is to change the storage policy to FTT=0 for all VMs during the maintenance. This means you would lose availability for all objects, but this is what happens during "ensure accessibility" anyway. Just remember to change the policy back to FTT=1 when you are done.

Cheers

Per

Reply
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello bulabog​,

Check that there is no resync ongoing:

Cluster > Monitor > vSAN > Resyncing components

Check that all vSAN Objects (e.g. vmdks) are compliant with their Storage Policy(SP):

Home > Policies and Profiles > VM Storage Policies > Select the SPs in use and check the VMs and disks all show as 'Compliant'

In a 2+1 configuration such as yours, compliance for VMs with the Default SP means that there is a data-mirror of each Object residing on each node + Witness components for tie-breaker residing on the Witness - this requires 3 Fault Domains for component placement (node+node+witness).

VM Objects only require access to a single data-mirror (and witness component/majority) for the VM to remain accessible, so when you place a data-node in Maintenance Mode with 'Ensure Accessibility'(EA) you are essentially telling the cluster to not use the data on the host in MM and use the other data-mirror instead. When in this state, VM data is not protected from any failures e.g. a capacity disk dying as there is no redundancy, they are essentially FTT=0 until the other data-node is available to the cluster and the data has been resynced from the copy of data that is active on the node that remains up - so make sure you take and verify back-ups before doing this.

I wouldn't advise changing all VMs SP to FTT=0 as this will likely drop half the extraneous data-components from both nodes - not just the node that you are doing maintenance on - and this will result in all the data remaining on the node to have to be evacuated off to put the host in MM with EA which will take a lot longer.

Similarly if you have any data with FTT=0 SP applied located on this host by choice this will have to be evacuated off to put the host in MM with EA.

If a host is taking a long time to enter MM with EA then note what % it is working at e.g. 2% is pre-check, ~19% is vMotion of VMs and after that is data-evacuation - you can get more visibility of what exactly  it is doing from the vmkernel.log and clomd.log

Do note as you are doing a full re-install you will have to reconfigure the vSAN networking and join the host back to the cluster after:

VMware Knowledge Base

Bob

Reply
0 Kudos
bulabog
Enthusiast
Enthusiast
Jump to solution

Hi All,

Thank you for your inputs. Much appreciated. I've managed to get this to work by doing the following:

1. Put the host into maintenance mode, select "Ensure Data accessibility...." instead of the "Full Evacuate Data"

2. Once in maintenance mode, I then go to the VSAN disk groups and delete the Disk Group for the host I am working on.

3. After waiting for VSAN to do its bit, the cluster is now running in non-compliance mode (my VMs are still running and working but non-compliant).

4. I then exit maintenance mode on the host, then put it in maintenance mode again this time selecting "Full Evacuate Data"

5. Move the host out of the cluster and remove from inventory.

I was able to rebuild the host from scratch and add back to the VSAN cluster...During this time, we knew the risk of the vsan cluster running from 1 node (+ witness) but that is fine in our case.

I don't know why it couldn't do all of the above automatically when selecting  "Full Evacuate Data" the first time though...

Thank you.

Reply
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello bulabog

Sorry but I don't think you are getting how this works.

"2. Once in maintenance mode, I then go to the VSAN disk groups and delete the Disk Group for the host I am working on."

There is no need to remove the disk-groups when re-installing a host.

"4. I then exit maintenance mode on the host, then put it in maintenance mode again this time selecting "Full Evacuate Data" "

The host was already in MM so this changes nothing and is unnecessary.

There was no data on the node to evacuate as there were no disk-groups...

Bob

Reply
0 Kudos