drheim
Enthusiast
Enthusiast

With 2-node vSAN clusters, why not always choose "Full Data Migration" for maintenance mode?

Jump to solution

Hey guys,

We have a 6.5u2 cluster(disk format 5) that needs to be upgraded to 6.7u2(vsan on disk format 7).  There are only 2 nodes in the cluster and I need to upgrade them to 6.7u2, which also involves some firmware upgrades after, etc.  When putting a 2-node vsan cluster into maintenance mode, is there any reason you should not choose "full data migration" every time?  If it was more than 2 nodes, I know it might cause it to start re-striping the array, etc. which might not be needed if the server is only going offline for a few minutes, but with only 2 nodes I was thinking it should just be a copy(raid 1) between the 2 nodes, so why not always mark if as "full data migration"? That way you can shut it down as long as needed without having to worry about the Repair Delay Timer, etc?

The reason I ask is because a few months ago I put one host into maintenance mode and took it offline for an hour or so for hardware maintenance and when I brought it back up the entire environment crashed as soon as it came up.  It hit us hard.  I never even brought the offline server out of maintenance mode and the primary node dropped all storage for some reason when it started to re-sync or something.  I got it back online, but spent hours with VMWare Support on this and they never knew why.  My best guess was that I chose "Ensure Accessibility" and went over the default 60 minute repair delay timer, but VMWare support never had any ideas after 12 hours of looking through everything.

Thanks,

Dave

0 Kudos
1 Solution

Accepted Solutions
TheBobkin
VMware Employee
VMware Employee

Hello Dave,

"I was thinking it should just be a copy(raid 1) between the 2 nodes, so why not always mark if as "full data migration"?"

With default Storage Policy (PFTT=1,FTM=RAID1) it keeps a replica of the data on each data-node - it wouldn't be beneficial (nor compliant with the Storage Policy) to move a second copy of the data to the other node (only have to clone it all back later).

"That way you can shut it down as long as needed without having to worry about the Repair Delay Timer, etc?"

In the case of hosts in MM with EA, you don't have to worry about CLOM repair delay in a 1+1+1 or 2+1 or even a standard 3-node cluster as if it expires, where exactly would it rebuild the components to be compliant with the Storage Policy?

It wouldn't as it wouldn't increase FTT nor compliance and thus just delta resync should occur (e.g. the data it did not commit while out of the loop).

Bob

---------------------------------------------------------------------------------------------------------

Was it helpful? Let us know by completing this short survey here.

View solution in original post

4 Replies
depping
Leadership
Leadership

How would you do a full data migration in a 2 node cluster? You don't have anywhere to migrate the data to?

0 Kudos
drheim
Enthusiast
Enthusiast

Exactly, but that is the better of the 3 options if you are taking it offline for a few hours.  My guess is that it is probably better to increase Repair Delay Timer, and then choose "Ensure Accessibility", but trying to get some feedback on the differences in a 2-node cluster.

0 Kudos
mcarrillo01
VMware Employee
VMware Employee

Hello drheim,

                     You can look at the following thread for more insight:

2 node direct VSAN maintenance

You can also increase the "repair delay time" if you're sure the host will take more than 60 minutes to come back:

VMware Knowledge Base

Just in case, here you can find the consideration of moving into MM when using a 2 node cluster:

Maintenance Mode Consideration | vSAN 2 Node Guide | VMware

Regards,

Marlon C.

TheBobkin
VMware Employee
VMware Employee

Hello Dave,

"I was thinking it should just be a copy(raid 1) between the 2 nodes, so why not always mark if as "full data migration"?"

With default Storage Policy (PFTT=1,FTM=RAID1) it keeps a replica of the data on each data-node - it wouldn't be beneficial (nor compliant with the Storage Policy) to move a second copy of the data to the other node (only have to clone it all back later).

"That way you can shut it down as long as needed without having to worry about the Repair Delay Timer, etc?"

In the case of hosts in MM with EA, you don't have to worry about CLOM repair delay in a 1+1+1 or 2+1 or even a standard 3-node cluster as if it expires, where exactly would it rebuild the components to be compliant with the Storage Policy?

It wouldn't as it wouldn't increase FTT nor compliance and thus just delta resync should occur (e.g. the data it did not commit while out of the loop).

Bob

---------------------------------------------------------------------------------------------------------

Was it helpful? Let us know by completing this short survey here.

View solution in original post