VMware Beta Community
cchen2
VMware Employee
VMware Employee

Process stuck during upgrading

I've successfully deployed cluster via CSE 4.0 and tried to upgrade from 1.21.8 to 1.22.9.

After submitting the upgrade request via GUI, the upgrade didn't kick off. Then I restarted the rdeprojector pod, and the upgrade process started.

However, after waiting for over 40 minutes, I found that although the version of control nodes was successfully upgraded to 1.22.9, but the worker nodes stuck in 1.21.8 and an additional worker node was added (before upgrade 2 worker nodes, now 3 worker nodes) and stuck in process status (checking in cluster confiug api)

I suppose that the upgrade process stuck, and want to know what may be the trigger and how to fix.

** Some Advice **

1. It will be better if GUI can show the progress of the entire upgrade process, or at least show if the upgrade process is finished or under processing. In current version, it is confusing and hard to tell.

2. I noticed that the rolling update is done with maxSurge > 0, it means that cluster API will create more temp nodes for updating. For those resource sensitive tenant, it may be better to offer an option to config the maxSurge manually. 

0 Kudos
2 Replies
agoel
VMware Employee
VMware Employee

Hello cchen2

Were you able to see through the cluster upgrade in its entirety? Cluster upgrades are done in the rolling upgrade fashion, which is why you noticed new nodes added and the previous 1.21 nodes should go away. Were you able to witness the final state of the upgraded cluster?

We have taken rest of the feedback and will see what we can do about it by the GA timeframe. 

Aashima

lzichong
VMware Employee
VMware Employee

Hi cchen2,

Thanks for the detailed feedback. The rdeprojector not kicking off is a known issue and has been addressed in GA. The upgrade most likely is stuck as you said, and the trigger was most likely due to putting the order of MachineDeployment first and then MachineTemplate. We found that this led to inconsistencies on the worker nodes such that their template version was incorrect when we perform kubectl describe on the machine, hence upgrade gets stuck. This issue has been fixed for GA.

The workaround is to perform a delete operation on the machines that are stuck in provisioning state. You can get a list of machines by running 'kubectl get machines -A' after downloading the kubeconfig, and whichever machines are stuck you will need to delete by running 'kubectl delete machines -n vcdMachineNameSpace vcdMachineName', and this should delete the machine that was stuck and attempt  to provision another one with the correct versioning.

Let us know if this works for you.

Thanks!