VMware Cloud Community
Vel_VMware
Enthusiast
Enthusiast

vSAN host not moving to maintenance mode.

Hi,

I have an issue while taking ESXi host into maintenance mode which is part of vSAN cluster. I have chosen data evacuation as "Ensure accessibility from other host" option, but its stalled at 19% and not moving more than that. Even VM's not getting migrated.

Can any one please help me with, why it is happening and how can I rectify it. Also, it would be very helpful if some one helps me with explaining whats happening in vSAN background while taking host into maintenance mode.

Version vSAN 6.0

DRS mode - Partially automated.

Tags (1)
11 Replies
TheBobkin
Champion
Champion

Hello Vel_VMware,

Entering Maintenance Mode with Ensure Accessibility can take some time as any data components that are ONLY resident on that node will have to be moved, so if you have a lot of FTT=0 VMs or other FTT=1 Objects that have not been resynced, these will need to move to another host before the host can enter MM.

You can check this using the following via CLI:

# cmmds-tool find -t NODE_DECOM_STATE -f json      (this will show an entry with 'State: 4' if a node is entering MM and 'State: 6' once that node is in MM)

(This SHOULD also show a list of Objects that require move befoer it enters MM)

# tail -f /var/log/clomd.log (from host that is going into MM)

# tail -f /var/log/clomd.log | grep -i prog  (from host that is going into MM, this will show as Objects are migrated)

Be patient, don't cancel the job unless you have verified there is an issue.

Bob

Vel_VMware
Enthusiast
Enthusiast

Thanks a lot Bob, Can you help me to get some more clarification on below scenarios. 

I had waited almost more than 15 hours but no progress than 19%, though host is having only 6 VM's.

1) Will object be recreated if host not entering into maintenance mode within 60 minutes?

2) What would happen in background if I choose "Ensure data accessibility mode"?

3) If I do manual migration of all VM's and then put host into maintenance mode, will it impact anything to VM's running on the host which I have migrated?

4) If Partial mode of DRS setting will it not allow the VM's to migrate while host entering into maintenance mode?

Thanks in advance......

TheBobkin
Champion
Champion

Hello Vel_VMware,

Yes, the Web Client % progress bars can cause concern due to appearing as though the process is stuck - thus going by what is occurring via clomd.log and vmkernel.log when in doubt is the best option.

vMotioning the VMs that are registered on the host is not the long part of this process (assuming some data is being moved/resynced) - If you want to test this/rule this out then manually vMotion all the VMs off this host and then once they have migrated put the host into MM (EA).

1) Yes, assuming default clomd repair delay of 60 minutes is configured (on all hosts) the components that resided on the host put in MM will be rebuilt from the remaining replica data components - provided there are enough active nodes available to do this (e.g. you can't resync a 3-component Object with only 2 nodes available).

2) As I said in my last comment - any Objects that do NOT have the majority of components available on the remaining nodes in the cluster will have to be migrated off this node to 'ensure accessibility'. Thus if you have some/a lot of FTT=0 VMs then all the data from these will have to be moved off (regardless of whether the VM that this data is backing is or was registered/running on this). If all your VMs/Objects are using the Default vSAN Storage Policy or any other FTT=1 policy then no data should be getting migrated (provided it is all healthy and not resyncing).

3) No, unless you put a host in MM with 'No Action' it should not impact the availability of any VMs in the cluster and even this should only occur for FTT=0 Objects/VMs.

4) Test to see if DRS migrating VMs off is the issue here (by manually migrating them off first) or if data is being moved from when this occurred(less clomd.log | grep -i prog).

Any host affinity/anti-affinity or Reserved failover resources settings that could be an issue?

Bob

TheBobkin
Champion
Champion

Hello Vel_VMware,

Some pretty interesting results from testing this in HOL:

The only thing that appears to cause 19% 'hang' is if DRS is not set to 'Fully-Automated' - note I have only tested this in 6.5 GA labs (ESXi build 4564106), but I tried with a  range of factors including resyncs, FTT=0 Objects, FTT=1 Objects in different state etc.

So to avoid this issue, either manually evacuate hosts using vMotion (can batch select if use Shift+click - but they must be in same state so order by state by clicking the column) or set DRS to 'Fully-Automated'.

I may see if this is present in 6.5 U1, either way '19%' and any other specific % 'hangs' will assist in troubleshooting in the same way as the old vMotion % 'hangs' did.

Bob

Vel_VMware
Enthusiast
Enthusiast

Hi Bob,

Excellent, Could you able to identify how DRS automation level affects this maintenance mode to get progressed.

Thanks in advance.

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Vel_VMware,

Partially-Automated won't automatically vMotion VMS, it just gives recommendations in the same way that Manual mode does:

"In the vSphere Client inventory, right-click a host and select Enter Maintenance Mode.

- If the host is part of a partially automated or manual DRS cluster, a list of migration recommendations for virtual machines running on the host appears."

(Old article but functions the same AFAIK - pubs.vmware.com/vsphere-51/index.jsp?topic=%2Fcom.vmware.vsphere.resmgmt.doc%2FGUID-68DE940C-C2DC-47D3-B660-D3BA5A8B5A75.html )

So manually migrate (following the 'Partially Automated' recommendations), or use Fully Automated DRS (even temporarily).

Bob

senthilkumarms8
Enthusiast
Enthusiast

DRS partial automation settings will not automatically evacuate the VMs. either you have to manually migrate the VMs or set the DRS to fully automated for the time being

caster
Contributor
Contributor

Found this thread while searching for a solution.

Just recently I experienced the same conditions, put a host into maintenance mode and it was stuck at 19%.  When running the commands above to check the progress nothing was happening.  I tested manually migrating each VM one by one while in maintenance mode and was allow to.  On this particular host, a VM for replication from VMware was running.  This VM appliance for replication was causing maintenance mode to hang.  When I went to move this VM I was prompted with a message stating that the settings must be managed within the VM.  I migrated the VM for VMware replication manually and maintenance mode was allowed to complete. 

Reply
0 Kudos
VI__ESX_3_5
VMware Employee
VMware Employee

DRS in partial mode will not migrate the VMs to other hosts in the vsan cluster and because there are still running VM on the host, host can not complete the enter maintenance mode process.

So you have two choices.

1. Manually migrate the VMs to other hosts.

2. Set DRS to full automated. you can change this setting under vSphere DRS.

For both of these to work properly, please make sure vmotion network is configured properly.

FYI, there are three different mode to put the host in EMM.

1. Full data migration.

2. ensure accessibility

3. No action or no data migration.

Reply
0 Kudos
Alvaidas
Contributor
Contributor

It seems that you still have virtual machines on the ESXi host you need to migrate them and the host will enter the maintenance mode.

Thanks

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Alvaidas​,

Yes that was their issue as I pointed out above.

Just to clarify what it is doing at x% of entering Maintenance Mode for those troubleshooting:

0% - Task not started yet, if it is stuck in this state it can indicate communication issues with vCenter/hostd etc. or issues with vCenter/vSphere Client.

2% - Precheck

19% - vMotion of VMs off the host, if it is stuck at this point then you need to try manually moving remaining VMs which will indicate why they can't (or shouldn't) migrate in their current state (e.g. passthrough devices such as GPUs, no longer available ISO in CD/DVD device, no VM network available on other hosts, VM disks stored on local-only datastore, Affinity/Anti-Affinity rules, insufficient compute resources on destination hosts (either due to not enough, reservations or HA failover reserved settings)).

20%-100% - Resync and/or migration of data onto other nodes.

I will aim to get this documented this in a kb as I don't see any particular reason for the above to be internal-only.

Bob

Reply
0 Kudos