Solved: Re: TKGS Antrea

amayacittagill · ‎01-20-2022

Hey all,

I've been mucking about with K8s for a while using manual Debian builds and given where my company aligns itself, thought I should learn Tanzu. For now I have it running with DVS networking using AVI load balancer. When I come to deploy the actual TKGS cluster I get intermittent results, it appears from spending a while looking at logs that the Antrea agent doesn't always install correctly, if I re-deploy nodes then they will sometimes come up good, sometimes not. Without the agent there is no CNI and the node is next to useless.

The error I'm trying to track down now is this, I get this from the output of a failed agent pod on a busted node.

kubectl logs antrea-agent-5sjml -n kube-system
error: a container name must be specified for pod antrea-agent-5sjml, choose one of: [antrea-agent antrea-ovs] or one of the init containers: [install-cni]

If I ssh into the node then there is no antrea setup at all, just the default eth network interfaces, no ovs-system interface or any other interfaces beyond the default. Pod's wont run on these nodes with errors such as

NetworkPlugin cni failed to set up pod "pod_name" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /var/run/antrea/cni.sock: connect: no such file or directory

There is comms on git about this here https://github.com/antrea-io/antrea/issues/832 - the suggestion there is the order of how things come up. But reboots dont help, lots of rebuilds and eventually it works (with the same YAML specification). The more generic error is "FailedCreatePodSandBox" which comes before the above message.

I 100% know that this is caused by Antrea not installing properly, thing is I can't fix it - I tried re-installing Antrea like I do on my non-Tanzu debian nodes but it doesn't help. The TKGS file is consistent, I think there is a bug in the build process.

If anyone has come across this that would be great.

We are running:

vSphere 7.0.3 18778458

KR v1.21.6---vmware.1-tkg.1.b3d708a

amayacittagill · ‎01-21-2022

Ok so it's a very simply cause, each node will consume a /24 from your PodCIDR allocation. I had mine only at /23 so was only allowed one master and one worker before it was consumed... I thought it was intermittent but it was consistent for sure. I was just ignorant to its behaviour.

I've now set it to /20 which allows 16 nodes, which is plenty. 3 master and 13 worker. Hurah.

Thanks for your pointers Cormac, you rock. Also thanks to replying to the PM 🙂

Here is the output confirming the success... everything works fine now.

$ kubectl get no -o yaml | grep CIDR
podCIDR: 10.97.0.0/24
podCIDRs:
podCIDR: 10.97.1.0/24
podCIDRs:
podCIDR: 10.97.2.0/24
podCIDRs:
podCIDR: 10.97.4.0/24
podCIDRs:
podCIDR: 10.97.3.0/24
podCIDRs:

View solution in original post

CHogan · ‎01-21-2022

So the response you are getting from the kubectl logs command simply means that the Pod has multiple containers, and you need to specify one of the containers listed, either antrea-agent or antrea-ov.

Therefore, the command you need to run will be something like:

kubectl logs antrea-agent-5sjml -n kube-system antrea-agent

Hopefully that will help to narrow it down a bit. However I would say that, in my experience, the CNI not coming online is due to some other condition which occurred previously. I would see if there is anything useful in the kubelet status or cloud init logs on the node which might help.

http://cormachogan.com

amayacittagill · ‎01-21-2022

Ah yes that makes sense on the log front. I forgot about pods with more than one container 🙂

Ok, so I now know what's causing it, but I don't understand why it's not working consistently given I'm using the same YAML for the cluster each time I try.

The log from the antrea agent gives the clue

kubectl logs -n kube-system antrea-agent-5sjml -c antrea-agent
I0121 22:32:03.495710 1 log_file.go:99] Set log file max size to 104857600
I0121 22:32:03.496606 1 agent.go:66] Starting Antrea agent (version v0.13.5-2d26d15)
I0121 22:32:03.496703 1 client.go:34] No kubeconfig file was specified. Falling back to in-cluster config
I0121 22:32:03.499056 1 prometheus.go:151] Initializing prometheus metrics
I0121 22:32:03.499336 1 ovs_client.go:67] Connecting to OVSDB at address /var/run/openvswitch/db.sock
I0121 22:32:03.499604 1 agent.go:205] Setting up node network
I0121 22:32:03.537826 1 agent.go:656] Setting Node MTU=1450
E0121 22:32:03.537908 1 agent.go:690] Spec.PodCIDR is empty for Node tkgs-v2-production-worker-nodepool-01-26w5h-869b556678-qw8km. Please make sure --allocate-node-cidrs is enabled for kube-controller-manager and --cluster-cidr specifies a sufficient CIDR range
F0121 22:32:03.538402 1 main.go:58] Error running agent: error initializing agent: CIDR string is empty for node tkgs-v2-production-worker-nodepool-01-26w5h-869b556678-qw8km

If I then compare a working node to a bust one, I see the PodCIDR is indeed missing.

Working Node

kubectl get no tkgs-v2-production-worker-nodepool-01-26w5h-869b556678-jbbdq -o yaml | grep spec -C 3
name: tkgs-v2-production-worker-nodepool-01-26w5h-869b556678-jbbdq
resourceVersion: "360045"
uid: ea3d257b-2814-48bd-b11b-b5eff9b8061e
spec:
podCIDR: 10.97.3.0/24
podCIDRs:
- 10.97.3.0/24

Bust Node
kubectl get no tkgs-v2-production-worker-nodepool-01-26w5h-869b556678-6d6st -o yaml | grep spec -C 3
name: tkgs-v2-production-worker-nodepool-01-26w5h-869b556678-6d6st
resourceVersion: "360490"
uid: 82a56889-d831-41f0-9052-4df6b023aac9
spec:
providerID: vsphere://421f50a6-69ef-7655-88aa-33995a73103a
status:
addresses

All of the bust nodes are missing the PodCIDR, thing is all nodes use the same CIDR as part of the YAML spec. I know with kubeadm init you pass this under --pod-network-cidr. However these are build from YAML using the supervisor cluster. The YAML has the entry.. it's just not always being picked up properly.

pods:

cidrBlocks: ["10.97.2.0/23"]

Whilst mucking around I build a separate TKG management and workload cluster using tanzu managment-cluster create - away from the supervisor cluster and I have no issues with the CNI initialising properly this way.

This is only an (intermittent) issue when doing this using vSphere with Tanzu via the supervisor cluster with "kubectl apply -f prod-v2-cluster-config.yaml"

I might try reverting to a different release, I'm using v1.21.6---vmware.1-tkg.1.b3d708a with the supervisor cluster, where as the TKG template image is v1.21.2. I'll report back..

amayacittagill · ‎01-21-2022

Ok so it's a very simply cause, each node will consume a /24 from your PodCIDR allocation. I had mine only at /23 so was only allowed one master and one worker before it was consumed... I thought it was intermittent but it was consistent for sure. I was just ignorant to its behaviour.

I've now set it to /20 which allows 16 nodes, which is plenty. 3 master and 13 worker. Hurah.

Thanks for your pointers Cormac, you rock. Also thanks to replying to the PM 🙂

Here is the output confirming the success... everything works fine now.

$ kubectl get no -o yaml | grep CIDR
podCIDR: 10.97.0.0/24
podCIDRs:
podCIDR: 10.97.1.0/24
podCIDRs:
podCIDR: 10.97.2.0/24
podCIDRs:
podCIDR: 10.97.4.0/24
podCIDRs:
podCIDR: 10.97.3.0/24
podCIDRs:

amayacittagill · ‎01-21-2022

Might be worth adding in the /24 consumption per node in the documentation, it’s not obvious. Even in the k8 doc it’s not obvious to me, it implies nodes consume CIDR but doesn’t specifically say /24… the GCP docs do flesh it out a bit more.

CHogan · ‎01-24-2022

So the only place I see this specified is in the minimum network requirements for the HA Proxy - https://docs.vmware.com/en/VMware-vSphere/7.0/vmware-vsphere-with-tanzu/GUID-C3048E95-6E9D-4AC3-BE96... - where it states that the Kubernetes services CIDR must be a /16. We don't seem to highlight this in the NSX ALB section though. Let me highlight this to the docs team and see if we can make this more visible. Good luck with the rest of the testing

http://cormachogan.com

amayacittagill · ‎01-24-2022

Cool thanks, I never read that bit as I wasn't doing HA Proxy. As well as saying it needs /16 I think its worth saying each node consumes a /24 by default? Nice and clear then 🙂