jamesmcewan
Contributor
Contributor

TKG Guest Cluster hung in "Creating" phase

We've recently set up a greenfield vSphere 7 environment as a PoC for running Tanzu with NSX-T (VDS) networking.

We're running the following versions for the different components:
VMware ESXi: 7.0.1, 17551050
vCenter: 7.0.1, 17491160
NSX-T: 3.1.0, 17107167

We successfully enabled Workload Management for the cluster, and verified the status of the supervisor cluster with "kubectl get nodes" and "kubectl get pods -A".

NAME                               STATUS   ROLES    AGE   VERSION
4221373f949eab149795e6e9a54fd7ed   Ready    master   22h   v1.18.2-6+38ac483e736488
42215d7c8ca156f1001ddab6381ed592   Ready    master   22h   v1.18.2-6+38ac483e736488
42218787e2e8b830f2bc12348f7d749a   Ready    master   22h   v1.18.2-6+38ac483e736488
esx1             Ready    agent    21h   v1.18.2-sph-83e7e60
esx2             Ready    agent    21h   v1.18.2-sph-83e7e60
esx3             Ready    agent    21h   v1.18.2-sph-83e7e60

Our problem is that our first deployment of a TKG guest cluster has been stuck in the "Creating" phase since yesterday afternoon:

jamesmcewan_0-1615535871739.png

From the vCenter UI, I can see that the first control plane VM has been deployed, but it has not been powered on, and does not reside in the correct network. 

If I dig into the logs for the vmware-system-vmop controller pod, I can confirm that the issue appears to be with adding the control plane vm into the correct network:

E0312 07:58:46.135129       1 network_provider.go:567] vsphere "msg"="Failed to search for nsx-t network associated with vnetif" "error"="opaque network with ID 'afd4d896-5590-4384-bb85-c742503006e6' not found"  "vnetif"={"metadata":{"name":"james-demo-k8s-dbwv75l2mz0mazwe1qft7m52q-vnet-james-demo-k8s-dbwv75l2mz0mazwe1qft7m52q-control-plane-jhkn8-lsp","namespace":"james-k8s-ej7po","selfLink":"/apis/vmware.com/v1alpha1/namespaces/james-k8s-ej7po/virtualnetworkinterfaces/james-demo-k8s-dbwv75l2mz0mazwe1qft7m52q-vnet-james-demo-k8s-dbwv75l2mz0mazwe1qft7m52q-control-plane-jhkn8-lsp","uid":"37ec4ecc-f86c-4085-ae3b-10671c70cb4a","resourceVersion":"171030","generation":2,"creationTimestamp":"2021-03-11T14:36:54Z","ownerReferences":[{"apiVersion":"vmoperator.vmware.com/v1alpha1","kind":"VirtualMachine","name":"james-demo-k8s-dbwv75l2mz0mazwe1qft7m52q-control-plane-jhkn8","uid":"791d2536-0ea9-4bcd-a173-59549e3eb450"}],"managedFields":[{"manager":"manager","operation":"Update","apiVersion":"vmware.com/v1alpha1","time":"2021-03-11T14:36:54Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:ownerReferences":{".":{},"k:{\"uid\":\"791d2536-0ea9-4bcd-a173-59549e3eb450\"}":{".":{},"f:apiVersion":{},"f:kind":{},"f:name":{},"f:uid":{}}}},"f:spec":{".":{},"f:virtualNetwork":{}},"f:status":{}}},{"manager":"nsx-ncp-68bff46dcf-q6w45","operation":"Update","apiVersion":"vmware.com/v1alpha1","time":"2021-03-11T14:36:55Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:conditions":{},"f:interfaceID":{},"f:ipAddresses":{},"f:macAddress":{},"f:providerStatus":{".":{},"f:nsxLogicalSwitchID":{}}}}}]},"spec":{"virtualNetwork":"james-demo-k8s-dbwv75l2mz0mazwe1qft7m52q-vnet"},"status":{"conditions":[{"status":"True","type":"Ready"}],"interfaceID":"0ecf7929-230b-40ac-a5c5-e454d23406bd","ipAddresses":[{"gateway":"10.5.64.33","ip":"10.5.64.34","subnetMask":"255.255.255.240"}],"macAddress":"04:50:56:00:78:03","providerStatus":{"nsxLogicalSwitchID":"afd4d896-5590-4384-bb85-c742503006e6"}}}
E0312 07:58:46.135289       1 virtualmachine_controller.go:408] controllers/VirtualMachine "msg"="Provider failed to update VirtualMachine" "error"="failed to create vnic '{nsx-t james-demo-k8s-dbwv75l2mz0mazwe1qft7m52q-vnet \u003cnil\u003e }': opaque network with ID 'afd4d896-5590-4384-bb85-c742503006e6' not found"  "name"="james-k8s-ej7po/james-demo-k8s-dbwv75l2mz0mazwe1qft7m52q-control-plane-jhkn8"

However, the logical segment with ID 'afd4d896-5590-4384-bb85-c742503006e6' does exist in both NSX-T and vCenter, so I don't understand why the virtualmachine_controller is not able to find it?

Here is a call to the NSX-T API confirming the existence of the logical segment with unique_id 'afd4d896-5590-4384-bb85-c742503006e6':

{
"type": "ROUTED",
"subnets": [
{
"gateway_address": "10.5.64.33/28",
"network": "10.5.64.32/28"
}
],
"connectivity_path": "/infra/tier-1s/t1_41fb8287-ab8a-4368-9cca-2e652a8cc17f_rtr",
"transport_zone_path": "/infra/sites/default/enforcement-points/default/transport-zones/c5b365e5-8ccd-4559-a822-48ec6745e9dd",
"advanced_config": {
"address_pool_paths": [
"/infra/ip-pools/vnet_ba15c213-411e-4822-ad2f-31ba0241586e_0"
],
"hybrid": false,
"inter_router": false,
"local_egress": false,
"urpf_mode": "STRICT",
"connectivity": "ON"
},
"admin_state": "UP",
"replication_mode": "MTEP",
"resource_type": "Segment",
"id": "vnet_ba15c213-411e-4822-ad2f-31ba0241586e_0",
"display_name": "vnet-domain-c2035:50a8ecea-4938-4de5-964d-5b9b796ea787-james-k8s-ej7po--fa589-0",
"tags": [
{
"scope": "ncp/version",
"tag": "1.2.0"
},
{
"scope": "ncp/cluster",
"tag": "domain-c2035:50a8ecea-4938-4de5-964d-5b9b796ea787"
},
{
"scope": "ncp/vnet",
"tag": "james-demo-k8s-dbwv75l2mz0mazwe1qft7m52q-vnet"
},
{
"scope": "ncp/vnet_uid",
"tag": "ba15c213-411e-4822-ad2f-31ba0241586e"
},
{
"scope": "ncp/created_for",
"tag": "vif_network"
},
{
"scope": "ncp/project",
"tag": "james-k8s-ej7po"
},
{
"scope": "ncp/project_uid",
"tag": "41fb8287-ab8a-4368-9cca-2e652a8cc17f"
}
],
"path": "/infra/segments/vnet_ba15c213-411e-4822-ad2f-31ba0241586e_0",
"relative_path": "vnet_ba15c213-411e-4822-ad2f-31ba0241586e_0",
"parent_path": "/infra",
"unique_id": "afd4d896-5590-4384-bb85-c742503006e6",
"marked_for_delete": false,
"overridden": false,
"_create_user": "wcp-cluster-user-domain-c2035-c740098a-5bb1-4750-9baa-25e40f8de1ae",
"_create_time": 1615473342333,
"_last_modified_user": "wcp-cluster-user-domain-c2035-c740098a-5bb1-4750-9baa-25e40f8de1ae",
"_last_modified_time": 1615473342343,
"_system_owned": false,
"_protection": "REQUIRE_OVERRIDE",
"_revision": 0
}

And a screenshot showing existence in vCenter:

jamesmcewan_1-1615537067144.png

Any ideas as to what the root cause of the issue may be, or how I can troubleshoot further?

Thanks,
James.

0 Kudos
1 Reply
microlytix
Enthusiast
Enthusiast

Hi @jamesmcewan 

Workload Management can be tricky indeed. 🙂

If one of the deployment parameters isn't correct you either fail to enable Workload Management or get into trouble later.

I had problems in enabling workload management (infinite loop) which almost drove me nuts.

I was glad to find a blog post by @jasonboche which helped me to find my configuration bug.

* check your TEP network with large frames (1572). vmk10 must be able to send large frames to all other host-TEP and also edge-TEP

* check ingress- and egress-CIDR to be part of your Edge uplink network (that was my mistake).

* check connection from NSX-T to vCenter (trust)

 

Even if SupervisorControlPlane VMs were successfully deployed, you might get into troble with TKG deployments.

I can really recommend Jason's post and his troubleshooting steps.

Kind regards

Michael

blog: https://www.elasticsky.de/en
0 Kudos