Re: vSphere with Tanzu | TKGs Cluster | Volumes Fa...

CrossBound · ‎02-09-2023

I have a new TKGs cluster running under vSphere with Tanzu. In the cluster, we are attempting to deploy our workloads, but some volumes seem to be failing to mount. The PVC, PV, and VolumeAttachment are getting created, but failing to mount into the pod. Oddly, some are working, so it's not a consistent issue. We have 3 workers. It does appear, that the working volume attachments are all on the same worker node, and the volume attachments that are failing are on the other 2 nodes. I'm guessing this means there is something wrong with the nodes, but I don't see any obvious issues.

If I describe one of the volume attachments that is failing to mount, I see the following error:

rpc error: code = Internal desc = Watch on virtualmachine "leaf-primary-prod-workers-c7jtr-5c6b5488b9-2rb85" timed out

I've found some articles related to errors with mounting, but they seem to be related to when a node gets deleted and then the pod fails to come up on a new node. This does not apply to us, since we have not deleted any nodes or had any node failures, etc. that I'm aware of.

Does anyone have an idea what might be causing the volumes to fail to mount?

EDIT: I tried draining one of the nodes and rebooting it, and then moving workloads back to it. They still failed. I also drained both nodes where volumes were failing to mount, and all of the volumes successfully mounted on the remaining node (the one that worked the whole time). It definitely seems like there is an issue with 2 of the nodes and 1 of the nodes is working. I just can't figure out what the problem is.

EDIT: I see the following in the kubelet logs on one of the failing nodes:

Feb 09 22:01:16 leaf-primary-prod-workers-c7jtr-5c6b5488b9-2rb85 kubelet[469]: E0209 22:01:16.211905     469 kubelet.go:1751] "Unable to attach or mount volumes for pod; skipping pod" err="unmounted volumes=[nats-jwt-pvc nats-js-pvc], unattached volumes=[pid nats-jwt-pvc nats-js-pvc kube-api-access-qq7hr config-volume]: timed out waiting for the condition" pod="nats-next/nats-1"
Feb 09 22:01:16 leaf-primary-prod-workers-c7jtr-5c6b5488b9-2rb85 kubelet[469]: E0209 22:01:16.211948     469 pod_workers.go:951] "Error syncing pod, skipping" err="unmounted volumes=[nats-jwt-pvc nats-js-pvc], unattached volumes=[pid nats-jwt-pvc nats-js-pvc kube-api-access-qq7hr config-volume]: timed out waiting for the condition" pod="nats-next/nats-1" podUID=bf51e525-a9c7-49bb-93e3-47b67c0750c3

EDIT: I've noticed when the problem occurs that in the vSphere UI it shows a couple of tasks that are getting repeated over and over indefinitely. One task is for "Attach container volume" and it completes successfully, but the second task for "Attach a virtual disk" fails each time with the error below.

Database temporarily unavailable or has network problems.

I've also created a completely new namespace and cluster, and the same problem seems to be presenting on the newer cluster as well.

Juan-Herrera · ‎03-21-2023

Hello,

Just to understand you environment, do you use vsphere with tanzu with AVI or HAProxy?

CrossBound · ‎03-21-2023

We are using NSX ALB (Avi)

Juan-Herrera · ‎03-21-2023

Did you check If CSI service is running ok on AVI ? Also, please login to your supervisor cluster and check that csi pods are running ok.

CrossBound · ‎03-21-2023

> please login to your supervisor cluster and check that csi pods are running ok

I don't know how to do that.

Juan-Herrera · ‎03-21-2023

For login to your supervisor cluster you will need an administrator account. Please follow this process https://docs.vmware.com/en/VMware-vSphere/7.0/vmware-vsphere-with-tanzu/GUID-F5114388-1838-4B3B-8A8D...

Then when you are connected to your supervisor cluster, check that you are in the correct context:

kubectl config get-contexts (my supervisor it's called tkg-management)

To change the context you have to use this command:

kubectl config set-context tkg-management-admin@tkg-management (this is the name of my supervisor context)

After select the correct contexts execute this command

kubectl get pods -A

You should see vsphere-csi pods running.

If all services are OK, please review this post, maybe you have nodes that does not exist in your cluster and are trying to mount volumes. https://veducate.co.uk/kubelet-unable-attach-volumes/

All

vSphere with Tanzu | TKGs Cluster | Volumes Fail to Mount