VMware Cloud Community
Chrismo16
Contributor
Contributor
Jump to solution

Update Manager Remediation Stuck at 30-33% for Days

I'm running the latest VCSA and am trying to apply patches to my cluster of 2 vSphere 6.5 hosts. I've tried applying them to one server at a time, just one patch at a time, etc. No matter what I try the remediation sticks at either 30% or 33%.

I prestage the updates and all 7 non critical patches have staged and 1 of 3 critical patches has as well. The other 2 criticals don't seem to be staging.

I  could use some help troubleshooting this including locating the correct logs for VUM in the VCSA.

Thanks in advance!

Tags (2)
1 Solution

Accepted Solutions
rcporto
Leadership
Leadership
Jump to solution

The host at least enter in maintenance mode? If not, try to place the host in maintenance mode manually and then try the remediation task again.

---

Richardson Porto
Senior Infrastructure Specialist
LinkedIn: http://linkedin.com/in/richardsonporto

View solution in original post

0 Kudos
12 Replies
rcporto
Leadership
Leadership
Jump to solution

The host at least enter in maintenance mode? If not, try to place the host in maintenance mode manually and then try the remediation task again.

---

Richardson Porto
Senior Infrastructure Specialist
LinkedIn: http://linkedin.com/in/richardsonporto
0 Kudos
Chrismo16
Contributor
Contributor
Jump to solution

Firstly, thank you. This absolutely worked.

I would like to discover why my hosts were not attempting to enter maintenance mode on there own when remediating the cluster. I am learning the platform and would like to learn more about automating things such as this and being able to troubleshoot will go a long way in achieving this.

Any suggestions, logs to look into, etc?

0 Kudos
schwupp
Contributor
Contributor
Jump to solution

same issue here. Cluster with 2x ESX 6.5 Host. MM of single host works perfect with full auto DRS, Remediation of the cluster (checked options disabling DPM, FT, HA) breaks at 30%.

Looked into /var/log/vmware/vmware-updatemgr/vum-server/vmware-vum-server-log4cpp.log and saw

[2017-08-18 12:25:13:842 'VciTaskBase.VciClusterJobDispatcherTask{204}' 140259801110272 ERROR]  [vciClusterJobSchedulerTask, 1035] DRS API returned faults

[2017-08-18 12:25:13:842 'VciTaskBase.VciClusterJobDispatcherTask{204}' 140259801110272 ERROR]  [vciClusterJobSchedulerTask, 1040] No current remediation is going on so address the faults

[2017-08-18 12:25:13:842 'VciTaskBase.VciClusterJobDispatcherTask{204}' 140259801110272 ERROR]  [vciClusterJobSchedulerTask, 1044] No of returned faults : 27

[2017-08-18 12:25:13:844 'VciTaskBase.VciClusterJobDispatcherTask{204}' 140259801110272 ERROR]  [vciClusterJobSchedulerTask, 1053] Name of VM that caused fault : <Name of all running VMs>

[2017-08-18 12:30:14:109 'VciTaskBase.VciClusterJobDispatcherTask{204}' 140259801110272 INFO]  [vciClusterJobSchedulerTask, 1071] DRS API indicates no active host available in cluster. Discarding fault.

[2017-08-18 12:30:14:181 'VciTaskBase.VciClusterJobDispatcherTask{204}' 140259801110272 ERROR]  [vciClusterJobSchedulerTask, 1666] Skipping host from remediation as DRS API did not recommend host host-270 to go into mmode

Don't know what should be wrong here?

0 Kudos
jimm_chen
Contributor
Contributor
Jump to solution

Having this same exact issue on VCSA 6.5 U1 and ESXi 6.5. Can't seem to find any explanation, KB or solution.

SKYWAYCH
Contributor
Contributor
Jump to solution

Same issue here

0 Kudos
pcookhayboo
Contributor
Contributor
Jump to solution

Having the same issue here.

0 Kudos
Marc_B
Contributor
Contributor
Jump to solution

Same issue here (27%). After that, I tried Critical Host Patches alone and then Non-Critical Host Patches alone and it worked. I hope it can help.

0 Kudos
karlg100
Contributor
Contributor
Jump to solution

I'm having the same issue.  look at the vpxd.log and grep for drmLogger.  if you see faults, that could give you a clue as to why DRS says it can't put a host into maintenance mode.

Some of my VMs I know have issues (stale backup flags) but even after clearing that out, there are sometimes "host affinity" issues.  but even if I clear those, hosts are still not entering maintenance mode and no errors in vpxd.log as to why DRS is unhappy with any of the hosts.

But my issue is I can clear all the faults, and the DRS API tells VUM that there are no hosts that can enter maintenance mode.

Here's a workaround that will work well, especially if you have VSAN enabled:

-schedule a patch job for all the hosts you want to re-mediate

-if you're lucky, some of your systems might patch

-allow the job to get stuck (watch the /var/log/vmware/vmware-updatemgr/vum-server/vmware-vum-server-log4cpp.log for errors like the below)

[2018-01-17 18:45:05:274 'VciTaskBase.VciClusterJobDispatcherTask{1111}' 140371731937024 INFO]  [vciClusterJobSchedulerTask, 1160] No of hosts recommended by DRS API : 0

[2018-01-17 18:45:05:274 'VciTaskBase.VciClusterJobDispatcherTask{1111}' 140371731937024 INFO]  [vciClusterJobSchedulerTask, 440] DRS API did not return any hosts

-go to a host that hasn't been patched (try the one with the fewest VMs first).  Go to the VM tab, and manually vMotion all the hosts off to a patched or another host

-after this, you should see in the vmware-vum-server-log4cpp.log that DRS will report the host is ready to be patched, which will trigger VUM to initiate a matenance mode, patch, reboot, and take the host out of maintenance mode.

-when the host is back up, repeat for all remaining hosts

-if you don't have VSAN, you could do this to multiple hosts at a time to the maximum number of concurrent hosts you configured in your VUM ojb

I have opened a ticket and they are looking into my issue.

Open a ticket with support as well so we get some priority behind fixing this.

0 Kudos
proBES
Contributor
Contributor
Jump to solution

Have the same issue, not using vSAN.

VMware was investigating this under SR17613017710, and after lots of testing with different HA settings we finally got this answer:

Remediating a two Node Cluster is not supported with vSphere 6.5 for availability reasons.

This behavior will be reflected as "Works as designed" within VMware documentation (nothing found so far...)

There is no way to change this behavior, opened a feature request as I believe one should be able to override this.

If you too think so please also place a feature request: Feature Request

anthonyvallejo
Contributor
Contributor
Jump to solution

I just submitted a SR request before stumbling across this thread for the same reasons. This fails for me on multiple clusters with more than 2 nodes....

And why would a 2 node VUM upgrade not be supported? Isn't that why they have the options to allow parallel cluster upgrades and making sure HA is disabled?

0 Kudos
karlg100
Contributor
Contributor
Jump to solution

I'm having issues with 6 and 3 node clusters using DRS.  if DRS is not employed, then everything works great.

0 Kudos
goslackware3
Contributor
Contributor
Jump to solution

In the 6.7 Documentation for VUM Remediating Hosts it now has the following note:

---

Note:

When you perform remediation on a cluster that consists of not more than two hosts, disabling HA admission control might not be enough to ensure successful remediation. You might need to disable vSphere Availability (HA) on the cluster. If you keep HA enabled, remediation attempts on host in the cluster fail, because HA cannot provide recommendation to Update Manager to place any of the hosts into maintenance mode. The reason is that if one of the two hosts is placed into maintenance mode there is no failover host left available in the cluster. To ensure successful remediation on a 2-node cluster, disable HA on the cluster or place the hosts in maintenance mode manually and then perform remediate the two host in the cluster.

---

I agree that VUM should be able to disable HA during the upgrade process...