VMware Cloud Community
anandgopinath
Enthusiast
Enthusiast
Jump to solution

vsan stretched cluster - not able to put data node into Maintenance mode

Dear Community 

 We have a VSAN  stretched cluster as below.  ( VMware ESXi, 7.0.3, 21313628 ) 

2 Data nodes  ( 1 in each Site )  and a witness host  

Most  of the VMs have  storage replicated , we use Should rules to spread them across the 2 Sites  

We have some VMs which are pinned to a specific site as below   .We use Must rules for them  . We can afford downtime on them during maintenance activities  . 

Site disaster tolerance None - keep data on Secondary (stretched cluster)
Failures to tolerate No data redundancy

Site disaster tolerance None - keep data on Preferred (stretched cluster)
Failures to tolerate No data redundancy

 

When we try to put a Data node in either site into maintenance mode via "ensure accessibility" or "no data migration" , the operation is failing  .

We even tried to power off the VMs using local storage on the impacted Data node but even then , VSAN cluster is not able to migrate the VM with replicated storage to other Data node  . 

Is this expected behaviour   ?  are we doing something wrong  ? 

Appreciate your help & guidance as always  

Labels (3)
0 Kudos
1 Solution

Accepted Solutions
depping
Leadership
Leadership
Jump to solution


@anandgopinath wrote:

so we have VMs with both  replication policy  (storage has 2 copies . one  in each failure domain  )  and   local site policy  ( storage is only in 1 failure domain ) 

for VMs with  local site policy   ( storage is only in 1 failure domain )  , 

why should the option "full data migration" or  "ensure accessability" not work   if  the other failure domain has storage capacity  ?  

I dont pin these VMs with " must run " rules anymore .  


Because you specified in which domain the data needs to reside? If you specify Preferred Site the data can only move to another host in that fault domain, which you don;t have.

View solution in original post

0 Kudos
17 Replies
TheBobkin
Champion
Champion
Jump to solution

@anandgopinath This is expected behaviour and working as intended - how could a cluster satisfy accessibility of objects when you are putting in Maintenance Mode the only node where this data resides? No Action option is the only option that will work with your described configuration. 

0 Kudos
anandgopinath
Enthusiast
Enthusiast
Jump to solution

@TheBobkin  :  Thanks for the quick help

As mentioned in my post  , even with the "no data migration" option  , we cannot get the host into maintenance mode as the host cannot migrate a VM with replicated storage and "should run" rule to the other host .

Not sure what we are doing wrong  

 

 

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

@anandgopinath Are you aware from previous testing whether this particular VM can actually live vMotion? It is not common but there can be various configurations at the VM level (e.g. passthrough device) that will prevent this being possible. Is it possible to power-off the VM, re-register/cold vMotion it and power it back on? If yes then there is probably a configuration on it preventing live vMotion.

 

It can always be other things like the backup proxy still has the base-vmdk attached and locked etc. - what error message are you getting when trying to vMotion it?

0 Kudos
anandgopinath
Enthusiast
Enthusiast
Jump to solution

@TheBobkin  ,  There is no issue with the VM vmotion etc  . When we poweroff the esx host in question , VM is restarted on the other esx host  . 

It is only when we try to enter  maintenance mode that the VM is not moved . 

 

 

0 Kudos
depping
Leadership
Leadership
Jump to solution

Powering Off a host doesn't lead to a vMotion, that is HA taking action.

Try manually migrating the VM from one host to another host to see if that works.

0 Kudos
depping
Leadership
Leadership
Jump to solution

Just wondering, if those hosts contain the vCLS VMs by any chance? Seen situations where those are not automatically powered-off when going into maintenance mode, which blocks the maintenance mode operation from completing.

0 Kudos
anandgopinath
Enthusiast
Enthusiast
Jump to solution

Thanks @depping   for the help  🙂

Very good point  , i will check this and revert  . 

 

 

0 Kudos
anandgopinath
Enthusiast
Enthusiast
Jump to solution

@depping  .  it seems issue is with the HA Admission control setting below .if HA admission control is disabled , host can enter maintenance mode  . 

So does this mean that HA Admission control is not compatible with 2 node Stretched cluster  ? 

CPU reserved for failover:

50 %
Memory reserved for failover:
50 %
Tags (1)
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

@anandgopinath, If you are reserving one nodes worth of compute resources then it won't allow putting the other node into MM as it cannot satisfy that reservation, so no, you shouldn't have it configured like that.

0 Kudos
anandgopinath
Enthusiast
Enthusiast
Jump to solution

@TheBobkin   , Thanks for the quick help 🙂

 

We have the same issue of maintenance mode not working when we choose   " ensure accessibility "  as well as  "full data migration"   . 

Same behaviour even if we disable the  "must run" rules for the VMs pinned to each site" . 

The only option which works is "No data migration" 

Is this also a limitation of the 2 Node stretched cluster  ? 

 

Thanks in advance for your continued support  & guidance  

 

 

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

@anandgopinath, I answered this already - this was literally your initial question.

0 Kudos
depping
Leadership
Leadership
Jump to solution


@anandgopinath wrote:

@TheBobkin   , Thanks for the quick help 🙂

 

We have the same issue of maintenance mode not working when we choose   " ensure accessibility "  as well as  "full data migration"   . 

Same behaviour even if we disable the  "must run" rules for the VMs pinned to each site" . 

The only option which works is "No data migration" 

Is this also a limitation of the 2 Node stretched cluster  ? 

 

Thanks in advance for your continued support  & guidance  

 

 


Yes, as you cannot migrate the data anywhere. You should be using "no data migration" indeed.

0 Kudos
anandgopinath
Enthusiast
Enthusiast
Jump to solution

@depping  @TheBobkin 

Thanks for the quick response  as always  🙂

i am a bit lost here . 

so we have VMs with both  replication policy  (storage has 2 copies . one  in each failure domain  )  and   local site policy  ( storage is only in 1 failure domain ) 

for VMs with  local site policy   ( storage is only in 1 failure domain )  , 

why should the option "full data migration" or  "ensure accessability" not work   if  the other failure domain has storage capacity  ?  

I dont pin these VMs with " must run " rules anymore .  

 

Same goes for VMs with replication policy  (storage has 2 copies . one  in each failure domain  )  . 

why should the option "full data migration" or  "ensure accessability" not work   if  the other failure domain has storage capacity  ?  

 

 

0 Kudos
depping
Leadership
Leadership
Jump to solution


@anandgopinath wrote:

@depping  @TheBobkin 

Same goes for VMs with replication policy  (storage has 2 copies . one  in each failure domain  )  . 

why should the option "full data migration" or  "ensure accessability" not work   if  the other failure domain has storage capacity  ?  


Why would we migrate the data to a host which already holds the data?

0 Kudos
depping
Leadership
Leadership
Jump to solution


@anandgopinath wrote:

so we have VMs with both  replication policy  (storage has 2 copies . one  in each failure domain  )  and   local site policy  ( storage is only in 1 failure domain ) 

for VMs with  local site policy   ( storage is only in 1 failure domain )  , 

why should the option "full data migration" or  "ensure accessability" not work   if  the other failure domain has storage capacity  ?  

I dont pin these VMs with " must run " rules anymore .  


Because you specified in which domain the data needs to reside? If you specify Preferred Site the data can only move to another host in that fault domain, which you don;t have.

0 Kudos
anandgopinath
Enthusiast
Enthusiast
Jump to solution

@depping  :  Got it now . Thanks  , so basically for these options to work, we need to have more than 1 host per failure domain  . 

Sorry for all the questions , we have been testing various failure / maintenance  scenarios like this  and at times what you read / understood before from the documentation seems lost  🙂

 

Thanks for taking time out from your busy schedule to support the community . Much appreciated  

0 Kudos
depping
Leadership
Leadership
Jump to solution

Correct, if you want to move data you would need more hosts.

0 Kudos