elihuj
Enthusiast
Enthusiast

Issues Enabling Workload Management with vSphere 7

I am attempting to setup Workload Management in a greenfield vSphere 7 environment with NSX-T and it continues to hang at "Error configurating cluster NIC on master VM. This operation is part of API server configuration and will be retried". I see the following in the wcpsvc.log file:

2020-09-08T16:16:54.416Z error wcp [opID=5f57bd08-domain-c8] Failed to create cluster network interface for MasterNode: VirtualMachine:vm-88. Err: Unauthorized

2020-09-08T16:16:54.416Z error wcp [opID=5f57bd08-domain-c8] Error configuring cluster NIC on master VM vm-88: Unauthorized

2020-09-08T16:16:54.416Z error wcp [opID=5f57bd08-domain-c8] Error configuring API server on cluster domain-c8 Error configuring cluster NIC on master VM. This operation is part of API server configuration and will be retried.

My vCenter, and NSX deployments are on the same Layer 2 segment. NSX-T is currently functioning, with a connectivity validated from a logical segment out to the Internet. I have also validated that MTU is 1600 throughout the environment.

0 Kudos
15 Replies
daphnissov
Immortal
Immortal

Are your hosts also running ESXi 7?

0 Kudos
elihuj
Enthusiast
Enthusiast

Yes, ESXi 7 build 16324942.

0 Kudos
VirtualizingStu
Enthusiast
Enthusiast

Hi elihuj,

Make sure the edge nodes are deployed as a medium (suggest large if you have the available resources) as the LB deployed is a medium size.

0 Kudos
elihuj
Enthusiast
Enthusiast

Hello VirtualizingStuff, thank you for the reply. I did deploy a Large Edge, but unfortunately that was not the fix. I tried it again, and for whatever reason it succeeded all the way through.

0 Kudos
laszlo_laszlo
Contributor
Contributor

Hello,

I've the same issue. NSX-T 3.1, VMware ESXi, 7.0.1, 17168206, vCenter build: 17004997

In NSX-T manager Alarm there is one Open issue when Workload Management hang. I'm using 3 NSX manager appliance.

Manager Node has detected the NCP is down or unhealthy.

Entity name: domain-c11:a83fdad6-c5e1-472e-a47b-d670fb2dd1c3

I noticed this entity is not exists. I'm very new in NSX-T so I do not know this error is relevant or not.

Transport nodes and Edge nodes Tunnels are fine if I'm right.

nsxt-01.PNGnsxt-02.PNG

Please give advice where should I search the root cause. Thank you.

0 Kudos
amdjfk
Contributor
Contributor

This error seems common as I see lots of people having the same issue. I wonder if anyone at VMware knows how to troubleshoot it?

 

0 Kudos
Yasen_Simeonov
VMware Employee
VMware Employee

Two most common reasons are:

1. Trust is not enabled in the Compute Manager for this vCenter in NSX.

2. Time between vCenter and NSX is not in sync.

 

0 Kudos
RaymundoEC
VMware Employee
VMware Employee

can you please get NCP log :

kubectl -n vmware-system-nsx logs <ncp-pod-name> -p

when you enabled WCP you enter "corp.local" as master DNS?

+vRay
0 Kudos
doskiran
Contributor
Contributor

Usually this kind of error occurs when master and worker DNS configured as same.
Actually the master DNS should be reachable from the management network and worker DNS should be reachable from workload network.
If both the DNS servers are same then it need to be reachable from both networks(Management/Workload).
To cross check the network reachability ,
- Connect to the Kubernetes API master VM
- Run below commands,
1) ping -I eth0 <masterDNS>
2) ping -I eth1 <workerDNS>

0 Kudos
amdjfk
Contributor
Contributor

Ok, this may be an issue. I am not well versed on the networking going on here. I am not sure how to assign IP addresses to the Ingress and Egress CIDRs. I assume by "worker" you mean these. I understand these need to be routable, But I can't figure out what VLAN they are on. I also don't have the capability to do BGP, and am not sure how to enter a route to these addresses. I can't even figure out what the interface to the T0 and T1 routers are. I understand networking, just not NSX-T. 

 

0 Kudos
nblr06
Enthusiast
Enthusiast

@doskiran 

Hi,

Do you know any other way to login the supervisor VM?

I had the same issue "Error configurating cluster NIC on master VM" therefore the "workload management" -> "namespaces" web page hanging at "workload management is still being configured. Please check back later".

I believe this "hanging" is preventing me from download and install k8s cli tool to connect to the control plane VMs.

 

By the way,

Do the DNS records need to be created for the master & worker before the deployment of workload management cluster?

thanks

0 Kudos
doskiran
Contributor
Contributor

Login into the Supervisor Master VM:

- SSH into the vCenter and enable shell(if required)

- Run "/usr/lib/vmware-wcp/decryptK8Pwd.py" to get the IP address and password for SC Master VM.

Eg:

# /usr/lib/vmware-wcp/decryptK8Pwd.py
Read key from file
Connected to PSQL
Cluster: domain-c8:2bcXXXX
IP: 10.xx.xx.xx
PWD: xxxxxxxxxxx

# ssh root@10.xx.xx.xx

type "yes" and provide above PWD.

 

After connect to supervisor master VM session , run the previous "ping" commands to check the Master/Worker DNS connectivity , nodes status like "kubectl get nodes" and system pods status "kubectl get pods -A" for troubleshooting.

>> Do the DNS records need to be created for the master & worker before the deployment of workload management cluster?

Its completely depends on your network, but for master directly use the management DNS. 

amdjfk
Contributor
Contributor

For those who are interested, I had to get BGP working on the ToR switch to get Workload Management to install. Maybe you can get by without it, but it didn't work for me. Just Sayin'

 

0 Kudos
nblr06
Enthusiast
Enthusiast

@doskiranThat's useful!

I discovered that a pod " tmc-agent-installer-1611810900-8n776" is in error status and another pod "vsphere-csi-controller-6687dc774f-xnbfq" is in crashloopbackoff status in the master.

I didn't have DNS records created for master/worker yet so the ping was unsuccessful.

The three masters are all in "ready" status(using "kubectl get nodes") so i can only assume that the hanging issue that I mentioned before was due to other unknown reason...

Thanks!

0 Kudos
unclebright
Contributor
Contributor

Hi, 

For those who are using BGP to get work the tanzu deployment here is the right tutorial 

https://vxplanet.com/2020/05/02/nsx-t-3-0-edge-cluster-automated-deployment-and-architecture-in-vcf-...

 

 

0 Kudos