All Posts

I've successfully deployed cluster via CSE 4.0 and tried to upgrade from 1.21.8 to 1.22.9. After submitting the upgrade request via GUI, the upgrade didn't kick off. Then I restarted the rdeprojecto... See more...
I've successfully deployed cluster via CSE 4.0 and tried to upgrade from 1.21.8 to 1.22.9. After submitting the upgrade request via GUI, the upgrade didn't kick off. Then I restarted the rdeprojector pod, and the upgrade process started. However, after waiting for over 40 minutes, I found that although the version of control nodes was successfully upgraded to 1.22.9, but the worker nodes stuck in 1.21.8 and an additional worker node was added (before upgrade 2 worker nodes, now 3 worker nodes) and stuck in process status (checking in cluster confiug api) I suppose that the upgrade process stuck, and want to know what may be the trigger and how to fix. ** Some Advice ** 1. It will be better if GUI can show the progress of the entire upgrade process, or at least show if the upgrade process is finished or under processing. In current version, it is confusing and hard to tell. 2. I noticed that the rolling update is done with maxSurge > 0, it means that cluster API will create more temp nodes for updating. For those resource sensitive tenant, it may be better to offer an option to config the maxSurge manually. 
The cluster events show ScriptExecutionError and the cluster is in "NOT Ready" status for ever. Found some errors in the logs. E0808 03:26:47.018187 1 controller.go:188] controller/kubeadmcontrolpl... See more...
The cluster events show ScriptExecutionError and the cluster is in "NOT Ready" status for ever. Found some errors in the logs. E0808 03:26:47.018187 1 controller.go:188] controller/kubeadmcontrolplane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: failed to retrieve kubeconfig secret for Cluster capvcd-cluster3-ns/capvcd-cluster3: secrets \"capvcd-cluster3-kubeconfig\" not found" "cluster"="capvcd-cluster3" "name"="capvcd-cluster3-control-plane" "namespace"="capvcd-cluster3-ns" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" The log bundle is attached. Could some help to take a look?
Following up on what sakthi2019 said, if you notice the replicas has changed after checking the RDE in entity->spec->capiYaml (you could control/command F and search for replicas if possible as well)... See more...
Following up on what sakthi2019 said, if you notice the replicas has changed after checking the RDE in entity->spec->capiYaml (you could control/command F and search for replicas if possible as well), then it is possible that the pod responsible for applying updates (rdeprojector) has stopped reconciliation. This is a known issue of the RDEProjector but has been fixed for GA. In order to resolve this you may need to delete the rdeprojector pod and let it restart. To do this, you will need to access the cluster with Kubernetes. 1. Download the Kubernetes config associated to your cluster from the UI. 2. After downloaded Kubernetes config, make sure you know the path of where it is as it needs to be specified when using Kubernetes commands. 3. Get the list of pods running by running 'kubectl get pods -A --kubeconfig=/path/of/kubernetes-config.txt'  4. Look for a pod with a name starting with 'rdeprojector-', and perform a delete command using 'kubectl delete pod -n rdeprojector-system rdeprojectorPodName --kubeconfig=/path/of/kubernetes-config.txt', this should force restart rdeprojector as it will automatically bring up a new pod after, and after sometime the new updates should apply.
Good to know that you have created a cluster.  Resize issue is a known issue for beta. CSE polls the RDE at regular interval to pickup any change. You may check using https://{{base_url}}/cloudapi/... See more...
Good to know that you have created a cluster.  Resize issue is a known issue for beta. CSE polls the RDE at regular interval to pickup any change. You may check using https://{{base_url}}/cloudapi/1.0.0/entities/types/vmware/capvcdCluster/1.1.0 to see the RDE got updated : entity->spec->capiYaml has got the updated replicas for worker node.      
I managed to create a new cluster, It is now in state ready. It was initially provisioned with 3 control plane and 1 worker node. I am trying to increase to two worker nodes. In the resize wizard ... See more...
I managed to create a new cluster, It is now in state ready. It was initially provisioned with 3 control plane and 1 worker node. I am trying to increase to two worker nodes. In the resize wizard i select 2 for "Number of Nodes" and click submit. I end up with the message "Acknowledged node pool resize request". But after that nothing. No new events or tasks. The CSE journal doesn't seem to contain anything relevant to this request, only a "status check" of the cluster every minute. Is it a known issue or is it supposed to work?
Now it is working: Task in vcenter below, and attached the journalctl cse logs       
Update: There was an issue with the vcenter service account used by NSX ALB. It has been fixed and the cluster creation reach new steps now.
Thank you for the reply. I can confirm that the "machine" is in pending state. I have also noticed pending task at the load balancer level. I do not know if they are at the origin of this issue ... See more...
Thank you for the reply. I can confirm that the "machine" is in pending state. I have also noticed pending task at the load balancer level. I do not know if they are at the origin of this issue or a symptom of it. No member in the pool so maybe they are not there because machine is in pending state, or the machine is in pending state because the pool doesn't have the members.  
capi-kubeadm-boostrap-system/logs.txt has : has the following infra issue. we will check more and provide any troubleshooting "reconciler kind"="KubeadmConfig" "worker count"=10 I0804 13:47:51.77... See more...
capi-kubeadm-boostrap-system/logs.txt has : has the following infra issue. we will check more and provide any troubleshooting "reconciler kind"="KubeadmConfig" "worker count"=10 I0804 13:47:51.777262 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1634" I0804 13:47:51.811648 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1641" I0804 13:47:51.817895 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1641" I0804 13:47:51.838246 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1642" ==== END logs for container manager of pod capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager-56bdcdf797-skq6h ====
Thanks for the log files. We will look into this file and provide any troubleshooting required
thanks for the screenshots. Screenshots show Ephemeral vm has all pods up and running.  But, it looks like the target machine objects are in pending state that is in a loop.
Thanks for the feedback. We will provide the feedback with the team working on this.
See https://kb.vmware.com/s/article/1002123  
Attached are the logs generated from https://github.com/vmware/cloud-provider-for-cloud-director/blob/main/scripts/generate-k8s-log-bundle.sh  I have stopped the cse service to keep access to the ep... See more...
Attached are the logs generated from https://github.com/vmware/cloud-provider-for-cloud-director/blob/main/scripts/generate-k8s-log-bundle.sh  I have stopped the cse service to keep access to the ephemeral VM.
I just noticed from Software requirements: VCD 10.3.3.1 (tested). Will work with VCD 10.3.1 or newer NSX-T 3.1.1 Avi 20.1.3 Does CSE next works with newer version of NSX and AVI as well? (So it... See more...
I just noticed from Software requirements: VCD 10.3.3.1 (tested). Will work with VCD 10.3.1 or newer NSX-T 3.1.1 Avi 20.1.3 Does CSE next works with newer version of NSX and AVI as well? (So it is a minimum version and not an exact version requirement) (Theoretically, CSE is only supposed to communicate with Cloud Director and let cloud director communicate with the other.) In our environment which was supported for cse 3.1.3 we are using: NSX-T 3.1.3.4 AVI 21.1.1 And the NSX-T version is managed by the Vcloud Foundation so a downgrade is not an option.
Hi, The cluster creation never completed and the cluster never ready. Then vapp and vm are removed and recreated later in loop. Result of a "cat /var/log/cloud-final.err" Note: I am not sure ... See more...
Hi, The cluster creation never completed and the cluster never ready. Then vapp and vm are removed and recreated later in loop. Result of a "cat /var/log/cloud-final.err" Note: I am not sure if we can login as root with ssh, so screenshot is from the console All pods seems ok. And from the journal i don't see relevant error But few errors earlier of type "412 Precondition Failed"   Seems the latest error in event then is associated to the event of deleting vapp and load balancer     Any suggestion of what should be the next step in troubleshooting?  
Hi, Would it be possible to add pre-requisite health check in the gui? (similar to Distributed Switch health check) It would avoid trying to fix a deployment not working. There could be multiple ... See more...
Hi, Would it be possible to add pre-requisite health check in the gui? (similar to Distributed Switch health check) It would avoid trying to fix a deployment not working. There could be multiple level of health check: Infrastructure: Confirm that all objects have been properly created in the API. (Maybe the logged in user would not have the right to see such settings so this test should run under a different account) User: Confirm the logged in user has all prerequisites permissions. Then the user select in which organization network to simulate a deployment: Confirm network has an ip pool configured with enough free IP address Confirm DNS are configured. Confirm the EDGE is properly configured with access to a load balancer, Confirm Enough external IP addresse available Confirm enough VIP available in the edge. Confirm enough capacity (CPU/Memory/Storage) Confirm sizing policies created. Then deploy a test VM in this network (similar to how the ephemeral VM would be created) Confirm that the test VM has access to DNS server That the VM has access to all URLS needed. (List should be provided in documentation, not all environment can provide full internet access) That the VM has access to Cloud Director (and eventually if certificates are trusted) List non-exhaustive If all test results passed/failed are visible in the gui, it would be easy to pinpoint wrong settings and fix them before even trying to deploy a cluster. Regards,
Hello,  Okay, so you're seeing the known issue where some non-TKG OVA vapp templates are being read as TKG OVA vapp templates by UI plugin. The UI Plugin v4.0.102 that I linked to above has that iss... See more...
Hello,  Okay, so you're seeing the known issue where some non-TKG OVA vapp templates are being read as TKG OVA vapp templates by UI plugin. The UI Plugin v4.0.102 that I linked to above has that issue fixed, and that issue will be fixed in GA as well. Thank you!
Hi Niandrew, thanks for your help! After deleting the existing vapp templates and catalogs owned by the user org, the workflow goes well!
Hello! I can help with the error you're seeing in the TKG OVA datagrid. Your CSE process seems to be running fine. I tried accessing your VCD testbed at 172.21.19.51, but seems it's not accessible. ... See more...
Hello! I can help with the error you're seeing in the TKG OVA datagrid. Your CSE process seems to be running fine. I tried accessing your VCD testbed at 172.21.19.51, but seems it's not accessible.  There's a few things we can try here. Let me know if you'd like to schedule a zoom call to go over this: 1. There is a known issue where non-TKG OVA vapp templates that are visible to the current user are being read by UI plugin as TKG OVAs. This can cause the error you're seeing. To quickly test this, you can either delete all non-TKG OVA vapp templates, or create a user in an org where only the TKG OVA vapp templates are visible (you can verify this with a Postman GET request to "https://{{host}}/api/query?type=vAppTemplate&format=records&page=1&pageSize=20&filterEncoded=true&sortAsc=name&links=true"  2. Alternatively, I have a test UI plugin build v4.0.102 (beta build is v4.0.101) here where this specific known issue is fixed: https://artifactory.eng.vmware.com/ui/native/cloud-director-solutions-generic-local/container-ui-plugin/4479576/ . Can you download the zip file, go to provider VCD -> customize portal -> upload the zip file -> disable container ui plugin v4.0.101 -> enable and publish container ui plugin v4.0.102 -> refresh browser -> try the workflow again to see if the error is gone. Please let me know how it goes or if you'd like to schedule a zoom session