ccalvetbeta
Enthusiast
Enthusiast

Deployment stuck in loop

Hi,

The cluster creation never completed and the cluster never ready.
Then vapp and vm are removed and recreated later in loop.

Result of a "cat /var/log/cloud-final.err"
Note: I am not sure if we can login as root with ssh, so screenshot is from the console

ccalvetbeta_0-1659607662110.png

ccalvetbeta_1-1659610970959.png
All pods seems ok.

ccalvetbeta_2-1659611033242.png
And from the journal i don't see relevant error

ccalvetbeta_3-1659611311153.png


But few errors earlier of type "412 Precondition Failed"

ccalvetbeta_4-1659611416100.png

ccalvetbeta_5-1659611492148.png

 

Seems the latest error in event then is associated to the event of deleting vapp and load balancer

ccalvetbeta_6-1659611620069.png

 

ccalvetbeta_7-1659611634049.png

 

Any suggestion of what should be the next step in troubleshooting?

 

0 Kudos
9 Replies
ccalvetbeta
Enthusiast
Enthusiast

I just noticed from Software requirements:
VCD 10.3.3.1 (tested). Will work with VCD 10.3.1 or newer
NSX-T 3.1.1
Avi 20.1.3

Does CSE next works with newer version of NSX and AVI as well? (So it is a minimum version and not an exact version requirement)

(Theoretically, CSE is only supposed to communicate with Cloud Director and let cloud director communicate with the other.)
In our environment which was supported for cse 3.1.3 we are using:
NSX-T 3.1.3.4
AVI 21.1.1

And the NSX-T version is managed by the Vcloud Foundation so a downgrade is not an option.

0 Kudos
ccalvetbeta
Enthusiast
Enthusiast

Attached are the logs generated from https://github.com/vmware/cloud-provider-for-cloud-director/blob/main/scripts/generate-k8s-log-bundl... 

I have stopped the cse service to keep access to the ephemeral VM.

0 Kudos
sakthi2019
VMware Employee
VMware Employee

thanks for the screenshots. Screenshots show Ephemeral vm has all pods up and running. 
But, it looks like the target machine objects are in pending state that is in a loop.

0 Kudos
sakthi2019
VMware Employee
VMware Employee

Thanks for the log files. We will look into this file and provide any troubleshooting required

0 Kudos
sakthi2019
VMware Employee
VMware Employee

capi-kubeadm-boostrap-system/logs.txt has : has the following infra issue. we will check more and provide any troubleshooting

"reconciler kind"="KubeadmConfig" "worker count"=10
I0804 13:47:51.777262 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting"
"kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler
group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1634"
I0804 13:47:51.811648 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1641"
I0804 13:47:51.817895 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1641"
I0804 13:47:51.838246 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1642"
==== END logs for container manager of pod capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager-56bdcdf797-skq6h ====

0 Kudos
ccalvetbeta
Enthusiast
Enthusiast

Thank you for the reply.
I can confirm that the "machine" is in pending state.

I have also noticed pending task at the load balancer level.
I do not know if they are at the origin of this issue or a symptom of it.

ccalvetbeta_0-1659684467537.png

No member in the pool so maybe they are not there because machine is in pending state, or the machine is in pending state because the pool doesn't have the members.

ccalvetbeta_0-1659684660645.png

 



0 Kudos
ccalvetbeta
Enthusiast
Enthusiast

Update: There was an issue with the vcenter service account used by NSX ALB.
It has been fixed and the cluster creation reach new steps now.

agoel
VMware Employee
VMware Employee

If you could elaborate what the issue with the service account was and what resolved it, might be helpful to other users. Thanks.

Aashima

0 Kudos
ccalvetbeta
Enthusiast
Enthusiast

Hi,
I do not have full details but from what i understood:
NSX ALB communicate with vCenter using a "vCenter account" dedicated for this purpose. (This is part of "create NSX-T Cloud) in vcenter.
So it seems somehow that NSX-ALB was not able to communicate with vCenter anymore. So maybe password has been modified or something like this.
Note: I am maybe mistaken an issue was with account connecting to NSX-Manager but the concept is the same, issue with credentials used with NSX-T cloud)
After fixing credentials the deployment was successful.

Summary:
The issue was not related to Tanzu/CSE but the underlying NSX-ALB infrastructure. Unfortunately it is not easy to pinpoint the origin when looking at error at Tanzu/CSE level.
Therefore, the feature requests of adding "pre-requisite" check and/or a wizard showing the progression of a cluster deployment step by step. (Showing the steps completed, current step, and next steps.) In this way it would be easier to pinpoint the origin of such issue if one step is stuck.

0 Kudos