I have a new TKGs cluster running under vSphere with Tanzu. In the cluster, we are attempting to deploy our workloads, but some volumes seem to be failing to mount. The PVC, PV, and VolumeAttachment are getting created, but failing to mount into the pod. Oddly, some are working, so it's not a consistent issue. We have 3 workers. It does appear, that the working volume attachments are all on the same worker node, and the volume attachments that are failing are on the other 2 nodes. I'm guessing this means there is something wrong with the nodes, but I don't see any obvious issues.
If I describe one of the volume attachments that is failing to mount, I see the following error:
rpc error: code = Internal desc = Watch on virtualmachine "leaf-primary-prod-workers-c7jtr-5c6b5488b9-2rb85" timed out
I've found some articles related to errors with mounting, but they seem to be related to when a node gets deleted and then the pod fails to come up on a new node. This does not apply to us, since we have not deleted any nodes or had any node failures, etc. that I'm aware of.
Does anyone have an idea what might be causing the volumes to fail to mount?
EDIT: I tried draining one of the nodes and rebooting it, and then moving workloads back to it. They still failed. I also drained both nodes where volumes were failing to mount, and all of the volumes successfully mounted on the remaining node (the one that worked the whole time). It definitely seems like there is an issue with 2 of the nodes and 1 of the nodes is working. I just can't figure out what the problem is.
EDIT: I see the following in the kubelet logs on one of the failing nodes:
Feb 09 22:01:16 leaf-primary-prod-workers-c7jtr-5c6b5488b9-2rb85 kubelet: E0209 22:01:16.211905 469 kubelet.go:1751] "Unable to attach or mount volumes for pod; skipping pod" err="unmounted volumes=[nats-jwt-pvc nats-js-pvc], unattached volumes=[pid nats-jwt-pvc nats-js-pvc kube-api-access-qq7hr config-volume]: timed out waiting for the condition" pod="nats-next/nats-1" Feb 09 22:01:16 leaf-primary-prod-workers-c7jtr-5c6b5488b9-2rb85 kubelet: E0209 22:01:16.211948 469 pod_workers.go:951] "Error syncing pod, skipping" err="unmounted volumes=[nats-jwt-pvc nats-js-pvc], unattached volumes=[pid nats-jwt-pvc nats-js-pvc kube-api-access-qq7hr config-volume]: timed out waiting for the condition" pod="nats-next/nats-1" podUID=bf51e525-a9c7-49bb-93e3-47b67c0750c3
EDIT: I've noticed when the problem occurs that in the vSphere UI it shows a couple of tasks that are getting repeated over and over indefinitely. One task is for "Attach container volume" and it completes successfully, but the second task for "Attach a virtual disk" fails each time with the error below.
Database temporarily unavailable or has network problems.
I've also created a completely new namespace and cluster, and the same problem seems to be presenting on the newer cluster as well.
For login to your supervisor cluster you will need an administrator account. Please follow this process https://docs.vmware.com/en/VMware-vSphere/7.0/vmware-vsphere-with-tanzu/GUID-F5114388-1838-4B3B-8A8D...
Then when you are connected to your supervisor cluster, check that you are in the correct context:
kubectl config get-contexts (my supervisor it's called tkg-management)
To change the context you have to use this command:
kubectl config set-context tkg-management-admin@tkg-management (this is the name of my supervisor context)
After select the correct contexts execute this command
kubectl get pods -A
You should see vsphere-csi pods running.
If all services are OK, please review this post, maybe you have nodes that does not exist in your cluster and are trying to mount volumes. https://veducate.co.uk/kubelet-unable-attach-volumes/