VMware Cloud Community
rebelfalls
VMware Employee
VMware Employee

Host fails to go into Maintenance Mode

After VxRail update I am unable to set node into maintenance mode. I am getting the error "not allowed in current state" 

I tried to set the node into MM using CLI which first looked like it works but the task never ended (waited until the next day) with "esxcli system maintenanceMode set -m ensureObjectAccessibility -e true" 

- All VMs were already migrated off from the node successfully, so it is unlikely that a VM which could not be migrated off the node was blocking the node from entering MM.
- VSANmgmt.log was checked but nothing found
- VSANsystem.log was checked but nothing found
- services.sh used to restart the services on the node => same behavior afterwards

Hostd shows host going into MM but the task doesn't fail

2020-11-02T12:39:33.933Z info hostd[2103296] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 79104 : The host has begun entering maintenance mode.
2020-11-02T12:39:33.934Z info hostd[2100959] [Originator@6876 sub=Vimsvc.TaskManager opID=f04735b4 user=vpxuser] Task Created : haTask--vim.event.EventHistoryCollector.readNext-372223905
2020-11-02T12:39:33.935Z info hostd[2103296] [Originator@6876 sub=Vimsvc.TaskManager opID=f04735b4 user=vpxuser] Task Completed : haTask--vim.event.EventHistoryCollector.readNext-372223905 Status success2020-11-02T12:39:33.971Z info hostd[2101423] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=vim-cmd-f3-35a9 user=root] Event 79105 : Host xxx.xxx in ha-datacenter has started to enter maintenance mode
2020-11-02T12:39:33.971Z info hostd[2101423] [Originator@6876 sub=Hostsvc opID=vim-cmd-f3-35a9 user=root] Message bus proxy is stopped already.
2020-11-02T12:39:33.971Z info hostd[2100957] [Originator@6876 sub=Vimsvc.TaskManager opID=04ba010d-35b2 user=dcui:vsanmgmtd] Task Created : vmodlTask-ha-host-372223906
2020-11-02T12:39:33.971Z info hostd[2100957] [Originator@6876 sub=Vimsvc.TaskManager opID=04ba010d-35b2 user=dcui:vsanmgmtd] Task Completed : haTask--vim.TaskManager.createTask-372223904 Status success
2020-11-02T12:39:33.973Z info hostd[2103354] [Originator@6876 sub=Vimsvc.TaskManager opID=f04735b5 user=vpxuser] Task Created : haTask--vim.event.EventHistoryCollector.readNext-372223907
2020-11-02T12:39:33.973Z info hostd[2103296] [Originator@6876 sub=Vimsvc.TaskManager opID=f04735b5 user=vpxuser] Task Completed : haTask--vim.event.EventHistoryCollector.readNext-372223907 Status success
2020-11-02T12:39:33.976Z info hostd[2100957] [Originator@6876 sub=Vimsvc.TaskManager opID=f04735ba user=dcui:vsanmgmtd] Task Created : haTask-ha-host-vim.Task.UpdateDescription-372223908

 

The vobd.log show that the host entered MM at 12:39 on 02/11 and exited at 07:14 on 03/11 between those times the log was spammed with "Firewall configuration has changed. Operation 'enable' for rule set esxupdate succeeded. Firewall configuration has changed. Operation 'disable' for rule set esxupdate succeeded."

Just before the host existed MM there was an alert that VMNIC2, VMNIC3 were down

2020-11-02T12:39:14.987Z: [UserLevelCorrelator] 2775057927891us: [esx.audit.ssh.session.opened] SSH session was opened for 'root@xxx.xxx.xxx'.
2020-11-02T12:39:33.933Z: [GenericCorrelator] 2775076873416us: [vob.user.maintenancemode.entering] The host has begun entering maintenance mode
2020-11-02T12:39:33.933Z: [UserLevelCorrelator] 2775076873416us: [vob.user.maintenancemode.entering] The host has begun entering maintenance mode
2020-11-02T12:39:33.933Z: [UserLevelCorrelator] 2775076873812us: [esx.audit.maintenancemode.entering] The host has begun entering maintenance mode.
2020-11-02T12:44:29.351Z: [GenericCorrelator] 2775372291176us: [vob.user.ssh.session.opened] SSH session was opened for 'root@xxx.xxx.xxx

***

2020-11-03T07:11:27.937Z: [netCorrelator] 23239165us: [vob.net.vmnic.linkstate.up] vmnic vmnic0 linkstate up
2020-11-03T07:11:27.944Z: [netCorrelator] 23246255us: [vob.net.vmnic.linkstate.up] vmnic vmnic1 linkstate up
2020-11-03T07:11:27.946Z: [netCorrelator] 23248365us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2020-11-03T07:11:27.948Z: [netCorrelator] 23250489us: [vob.net.vmnic.linkstate.down] vmnic vmnic3 linkstate down
2020-11-03T07:11:28.002Z: [netCorrelator] 23303941us: [esx.clear.net.vmnic.linkstate.up] Physical NIC vmnic0 linkstate is up
2020-11-03T07:11:28.002Z: An event (esx.clear.net.vmnic.linkstate.up) could not be sent immediately to hostd; queueing for retry.
2020-11-03T07:11:28.002Z: [netCorrelator] 23304011us: [esx.clear.net.vmnic.linkstate.up] Physical NIC vmnic1 linkstate is up
2020-11-03T07:11:28.002Z: An event (esx.clear.net.vmnic.linkstate.up) could not be sent immediately to hostd; queueing for retry.
2020-11-03T07:11:28.002Z: [netCorrelator] 23304035us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic2 linkstate is down
2020-11-03T07:11:28.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-11-03T07:11:28.002Z: [netCorrelator] 23304058us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic3 linkstate is down
2020-11-03T07:11:28.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-11-03T07:11:29.664Z: [netCorrelator] 24965794us: [vob.net.vmnic.linkstate.up] vmnic vusb0 linkstate up

***

2020-11-03T07:12:59.447Z: [UserLevelCorrelator] 114269950us: [vob.user.host.boot] Host has booted.
2020-11-03T07:12:59.447Z: [GenericCorrelator] 114269950us: [vob.user.host.boot] Host has booted.
2020-11-03T07:12:59.447Z: [UserLevelCorrelator] 114270270us: [esx.audit.host.boot] Host has booted.

***

2020-11-03T07:14:28.003Z: Successfully sent event (esx.audit.net.firewall.config.changed) after 1 failure.
2020-11-03T07:14:28.003Z: Successfully sent event (esx.audit.dcui.enabled) after 1 failure.
2020-11-03T07:14:28.003Z: Successfully sent event (esx.audit.shell.enabled) after 1 failure.
2020-11-03T07:14:28.003Z: Successfully sent event (esx.problem.clock.correction.adjtime.sync) after 1 failure.
2020-11-03T07:14:42.043Z: [GenericCorrelator] 216865712us: [vob.user.maintenancemode.exited] The host has exited maintenance mode
2020-11-03T07:14:42.043Z: [UserLevelCorrelator] 216865712us: [vob.user.maintenancemode.exited] The host has exited maintenance mode
2020-11-03T07:14:42.043Z: [UserLevelCorrelator] 216866113us: [esx.audit.maintenancemode.exited] The host has exited maintenance mode.

 

I thought if you just disable HA before remediation the process should work. As per this doc https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.update_manager.doc/GUID-90AA4FDB-B2...

I did disable the HA in the cluster but the behavior is still the same. I am not able to place a node in to MM using vCenter (results in Operation not allowed in current state) and the CLI reacts the same, so the node starts to enter MM with ensure accessibility but never ends the job.

There is some errors quoting 'firewall' could this be the issue? 


Any help is appreciated, thank you 

 

Reply
0 Kudos
7 Replies
scott28tt
VMware Employee
VMware Employee

@rebelfalls 

Moderator: Moved to vSphere Discussions - not specific to vCenter.


-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog
Reply
0 Kudos
nachogonzalez
Commander
Commander

Hey, hope you are doing fine:

Some quick questions

Do you have any pending tasks?
Is the host connected to vCenter?
Do you have NSX DFW in place?

Reply
0 Kudos
rebelfalls
VMware Employee
VMware Employee

Hey Nacho, 

thanks for your reply, to answer your questions 

Do you have any pending tasks? - no stuck or pending tasks in the task list in vcenter
Is the host connected to vCenter? - yes 
Do you have NSX DFW in place? no NSX in use at all

There was also no large vms that were being migrated at the time when the host is going into MM. VMs were manually migrated off the node before I tried to place into MM.

I tried with 2 more hosts and receiving the same error - "not allowed in current state"

Reply
0 Kudos
ZibiM
Enthusiast
Enthusiast

I was gonna to suggest service.sh restart, but I found in the desc. that you already did that

what is the drs status on the cluster ?

what is the vsan status on the cluster ?

do you have enough capacity to try to make mm with full data evacuation ?

Is there a chance you have some vsan objects with ftt=0 located on this node ?

Reply
0 Kudos
nachogonzalez
Commander
Commander

Do you have HA enabled?
How is Admission control configured? 
What happens if you turn off HA and try putting the host in MM?

Reply
0 Kudos
rebelfalls
VMware Employee
VMware Employee

 what is the drs status on the cluster ? Fully automated

- what is the vsan status on the cluster ? Healthy all green

- do you have enough capacity to try to make mm with full data evacuation? More than enough space but since vCenter doesn’t allow to place any of the 4 nodes in MM we cannot do a full data evacuation. When done via CLI the  task never ends

- Is there a chance you have some vsan objects with ftt=0 located on this node? No FTT=0 but one policy exists which is striping to 8 disks

Reply
0 Kudos
ZibiM
Enthusiast
Enthusiast

Do you have more than 2 capacity disks in your nodes ?

If not, then this might be your issue.

Please try to change policy to have strip width = 6

Reply
0 Kudos