CSE Service Observation and TKG cluster deployemen...

vBahubali01 · ‎07-29-2022

hello,

I made some progress with CSE OVA deployment on my management VLAN. Verified CSE Service status and its active. The first thing i noticed is that system admin service account password showing up in service status which is not good

● cse.service - Cloud Director Container Service Extension
Loaded: loaded (/etc/systemd/system/cse.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2022-07-29 09:00:51 UTC; 44min ago
Main PID: 1491 (bash)
Tasks: 7 (limit: 4693)
Memory: 26.4M

CGroup: /system.slice/cse.service
├─1491 /bin/bash /root/cse.sh
└─1498 /root/vkp -u svc-cse-vcd -p XXXXXXXX -o system -s https://cloud.lab.infra

Second i tried creating a new cluster. As part of deployment process a Virtual machine with name EPHEMERAL_TEMP_VM got created and after this i could not see any progress. In CSE Extension, cluster shows in reconciling state. Is it possible to check something in logs at what stage cluster deployment is ?

vBahubali01 · ‎07-29-2022

Just to give feedback regarding cluster deployment. I am experiencing below behavior.

EPHEMERAL_TEMP_VM is created and in the events i could see below events. Last event name is InfraVappAvailable.

After some time EPHEMERAL_TEMP_VM got deleted on its own
Again a new EPHEMERAL_TEMP_VM got deployed without any action from my side
I could see Virtual Service got created on edge gateway but was down
I could pool also got created without any member

I logged into EPHEMERAL_TEMP_VM and could see all needed files were downloaded as per bootstrap.sh script which confirm my connectivity with internet. I verified my cloud director portal endpoint is also getting resolved.

There are no logs under the path /var/log/vcd-ke/customization on EPHEMERAL_TEMP_VM to provide any clue regarding deployment not progressing. Cloud director version 10.3.3.

sakthi2019 · ‎08-01-2022

Login to CSE OVA machine where the service is running. Execute the following command to see the log statements

journalctl -u cse

sakthi2019 · ‎08-01-2022

After logged into EPHEMERAL_TEMP_VM, you can see detailed log messages are recorded in: /var/log/cloud-final.err.
This file records list of commands and statuses in the order of execution. Also, you can find out which command is in infinite loop or any error happened during the cluster creation.

Currently, if cluster creation fails for any reason, the error handler in CSE removes the ephemeral vm and stamps the RDE in error state. After 10 minutes, it gets picked-up again to get a retry on cluster creation.

Your feedback on service command-line arguments with exposed password is appreciated. We will do the necessary updates in the upcoming release.

vBahubali01 · ‎08-01-2022

hello,

i checked log file cloud-final.err and last command i could see is provisioning pending. Log file attached for reference

++ kubectl get machines -n tkgcl01-ns -o 'jsonpath={.items[*].status.phase}'
+ [[ Provisioning Pending =~ ^(Running )*Running$ ]]
+ sleep 20

vBahubali01 · ‎08-02-2022

hello,

Not sure i posted a message with logs which is not visible. Its getting marked as spam.

I checked logs using command journalctl -u cse and could see [ENF] Entity not found message. I am not sure why its trying to find a vApp with the name tkgcl01_ephemeral_vapp. I dont see any vapp with this name. vApp name is tkgcl01 which is the name i used while creating tkg cluster

sakthi2019 · ‎08-02-2022

At this point, you may want to stop the CSE service to get to the ephemeral vm before it is getting deleted and recreated as part of retry.

If you are able to successfully login to ephemeral vm, please do the following.

1. export KUBECONFIG=/.kube/config
2. kubectl get pods -A for pods that are stuck or having issues
3. kubectl logs for individual log

In addition if you can access: https://github.com/vmware/cloud-provider-for-cloud-director/blob/main/scripts/generate-k8s-log-bundl...

Please run the above script after setting KUBECONFIG on ephemeral vm and upload the log bundle. We can take a look for further investigation.

sakthi2019 · ‎08-02-2022

>I am not sure why its trying to find a vApp with the name tkgcl01_ephemeral_vapp.

We have a common function that is meant for user intitiated Delete Cluster operation as wells as for clean-up on error while creating the cluster.

As you mentioned, there is no tkgcl01_ephemeral_vapp created in your use case. It is just a info level message that can be ignored.

All

CSE Service Observation and TKG cluster deployement stuck on reconciling state