Hi,
The cluster creation never completed and the cluster never ready.
Then vapp and vm are removed and recreated later in loop.
Result of a "cat /var/log/cloud-final.err"
Note: I am not sure if we can login as root with ssh, so screenshot is from the console
All pods seems ok.
And from the journal i don't see relevant error
But few errors earlier of type "412 Precondition Failed"
Seems the latest error in event then is associated to the event of deleting vapp and load balancer
Any suggestion of what should be the next step in troubleshooting?
I just noticed from Software requirements:
VCD 10.3.3.1 (tested). Will work with VCD 10.3.1 or newer
NSX-T 3.1.1
Avi 20.1.3
Does CSE next works with newer version of NSX and AVI as well? (So it is a minimum version and not an exact version requirement)
(Theoretically, CSE is only supposed to communicate with Cloud Director and let cloud director communicate with the other.)
In our environment which was supported for cse 3.1.3 we are using:
NSX-T 3.1.3.4
AVI 21.1.1
And the NSX-T version is managed by the Vcloud Foundation so a downgrade is not an option.
Attached are the logs generated from https://github.com/vmware/cloud-provider-for-cloud-director/blob/main/scripts/generate-k8s-log-bundl...
I have stopped the cse service to keep access to the ephemeral VM.
thanks for the screenshots. Screenshots show Ephemeral vm has all pods up and running.
But, it looks like the target machine objects are in pending state that is in a loop.
Thanks for the log files. We will look into this file and provide any troubleshooting required
capi-kubeadm-boostrap-system/logs.txt has : has the following infra issue. we will check more and provide any troubleshooting
"reconciler kind"="KubeadmConfig" "worker count"=10
I0804 13:47:51.777262 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting"
"kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler
group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1634"
I0804 13:47:51.811648 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1641"
I0804 13:47:51.817895 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1641"
I0804 13:47:51.838246 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1642"
==== END logs for container manager of pod capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager-56bdcdf797-skq6h ====
Thank you for the reply.
I can confirm that the "machine" is in pending state.
I have also noticed pending task at the load balancer level.
I do not know if they are at the origin of this issue or a symptom of it.
No member in the pool so maybe they are not there because machine is in pending state, or the machine is in pending state because the pool doesn't have the members.
Update: There was an issue with the vcenter service account used by NSX ALB.
It has been fixed and the cluster creation reach new steps now.
If you could elaborate what the issue with the service account was and what resolved it, might be helpful to other users. Thanks.
Aashima
Hi,
I do not have full details but from what i understood:
NSX ALB communicate with vCenter using a "vCenter account" dedicated for this purpose. (This is part of "create NSX-T Cloud) in vcenter.
So it seems somehow that NSX-ALB was not able to communicate with vCenter anymore. So maybe password has been modified or something like this.
Note: I am maybe mistaken an issue was with account connecting to NSX-Manager but the concept is the same, issue with credentials used with NSX-T cloud)
After fixing credentials the deployment was successful.
Summary:
The issue was not related to Tanzu/CSE but the underlying NSX-ALB infrastructure. Unfortunately it is not easy to pinpoint the origin when looking at error at Tanzu/CSE level.
Therefore, the feature requests of adding "pre-requisite" check and/or a wizard showing the progression of a cluster deployment step by step. (Showing the steps completed, current step, and next steps.) In this way it would be easier to pinpoint the origin of such issue if one step is stuck.