mashio
Contributor
Contributor

Enable workload management hangs on configuring

i'm trying to configure TKG cluster on vsphere 7 for the first time.

NSX-T 3.0 configured and running.

when i enable workload management with all required info it's never finished configuring.

i can see in wcp log many messages in loop.

attaching error messages that repeatedly showing in the log:

020-06-29T12:52:51.438Z debug wcp informer.processLoop() lister.List() returned

2020-06-29T12:52:54.612Z error wcp [opID=5ef9ca68-domain-c8] Unexpected object: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:Failure,Message:an error on the server ("unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)") has prevented the request from succeeding,Reason:InternalError,Details:&StatusDetails{Name:,Group:,Kind:,Causes:[]StatusCause{StatusCause{Type:UnexpectedServerResponse,Message:unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body),Field:,},StatusCause{Type:ClientWatchDecoding,Message:unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body),Field:,},},RetryAfterSeconds:0,UID:,},Code:500,}

2020-06-29T12:52:54.612Z error wcp [opID=5ef9ca68-domain-c8] Error watching NSX CRD resources.

2020-06-29T12:52:54.612Z error wcp [opID=5ef9ca68-domain-c8] Error creating NSX resources. Err: Kubernetes API call failed. Details Error watching NSX CRD resources.

2020-06-29T12:52:54.612Z error wcp [opID=5ef9ca68-domain-c8] Failed to create cluster network interface for MasterNode: VirtualMachine:vm-1008. Err: Kubernetes API call failed. Details Error watching NSX CRD resources.

2020-06-29T12:52:54.612Z error wcp [opID=5ef9ca68-domain-c8] Error configuring API server on cluster domain-c8 An error occurred. This operation will be retried.

2020-06-29T12:52:54.832Z error wcp [opID=5ef9ca68-domain-c8] Unexpected object: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:Failure,Message:an error on the server ("unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)") has prevented the request from succeeding,Reason:InternalError,Details:&StatusDetails{Name:,Group:,Kind:,Causes:[]StatusCause{StatusCause{Type:UnexpectedServerResponse,Message:unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body),Field:,},StatusCause{Type:ClientWatchDecoding,Message:unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body),Field:,},},RetryAfterSeconds:0,UID:,},Code:500,}

2020-06-29T12:52:54.832Z error wcp [opID=5ef9ca68-domain-c8] Error watching NSX CRD resources.

2020-06-29T12:52:54.832Z error wcp [opID=5ef9ca68-domain-c8] Error creating NSX resources. Err: Kubernetes API call failed. Details Error watching NSX CRD resources.

2020-06-29T12:52:54.832Z error wcp [opID=5ef9ca68-domain-c8] Failed to create cluster network interface for MasterNode: VirtualMachine:vm-1007. Err: Kubernetes API call failed. Details Error watching NSX CRD resources.

2020-06-29T12:52:54.832Z error wcp [opID=5ef9ca68-domain-c8] Error configuring API server on cluster domain-c8 An error occurred. This operation will be retried.

2020-06-29T12:52:54.957Z error wcp [opID=5ef9ca68-domain-c8] Unexpected object: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:Failure,Message:an error on the server ("unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)") has prevented the request from succeeding,Reason:InternalError,Details:&StatusDetails{Name:,Group:,Kind:,Causes:[]StatusCause{StatusCause{Type:UnexpectedServerResponse,Message:unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body),Field:,},StatusCause{Type:ClientWatchDecoding,Message:unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body),Field:,},},RetryAfterSeconds:0,UID:,},Code:500,}

2020-06-29T12:52:54.957Z error wcp [opID=5ef9ca68-domain-c8] Error watching NSX CRD resources.

2020-06-29T12:52:54.957Z error wcp [opID=5ef9ca68-domain-c8] Error creating NSX resources. Err: Kubernetes API call failed. Details Error watching NSX CRD resources.

2020-06-29T12:52:54.957Z error wcp [opID=5ef9ca68-domain-c8] Failed to create cluster network interface for MasterNode: VirtualMachine:vm-1006. Err: Kubernetes API call failed. Details Error watching NSX CRD resources.

2020-06-29T12:52:54.957Z error wcp [opID=5ef9ca68-domain-c8] Error configuring API server on cluster domain-c8 An error occurred. This operation will be retried.

2020-06-29T12:52:54.957Z info wcp [opID=5ef9ca68-domain-c8] no single master succeeded - retrying

2020-06-29T12:52:54.957Z debug wcp Publish change event: &cdc.ChangeLogChangeEvent{Resource:std.DynamicID{Type_:"ClusterComputeResource", Id:"domain-c8"}, Kind:"UPDATE", Properties:[]string{"messages"}, ParentResources:[]std.DynamicID(nil)}

does anyone had a similar issue to this?

0 Kudos
20 Replies
daphnissov
Immortal
Immortal

It looks like you have issues communicating with NSX-T Manager. Describe your full networking config you supplied to the WCP wizard, please.

0 Kudos
CaptainCrunchNo
Contributor
Contributor

I am running into this same issue. Were you able to get this resolved?

My NSX manager, edge and Supervisor cluster IPs are all on the same layer 2 so there shouldn't be any connection issues.

0 Kudos
mashio
Contributor
Contributor

check that local DVS and underlay switch configured with MTU 9000

0 Kudos
daphnissov
Immortal
Immortal

You do not need an MTU of 9000, just anything at 1600 or higher.

0 Kudos
CaptainCrunchNo
Contributor
Contributor

Checked DVS and all MTU is set to 9000

0 Kudos
benfab
Contributor
Contributor

I'm facing the exact same issue, VSCA, NSX-T manager, NSX edge and the Supervisor VMs are on the same network and can talk to each other.
Have you solved this issue?


Thank you!

0 Kudos
vineethac
Contributor
Contributor

Hi all

I am having a similar issue while trying to enable workload management. Using VCSA 7 U1, ESXi 7 U1 and NSX-T 3.0.1.1.

pastedImage_2.png

3 supervisor control plane VMs are deployed and are in powered on state. I can also see a new T1 gateway, and some new segments, NAT rules and LBs in NSX-T manager.

tail -f /var/log/vmware/wcp/wcpsvc.log shows the following:

2020-10-13T19:08:04.3Z debug wcp [opID=5f862f9f] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:04.5Z debug wcp [opID=5f862fa0] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:04.699Z debug wcp [opID=5f862fa1] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:04.899Z debug wcp [opID=5f862fa2] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:04.952Z debug wcp [opID=5f86289f-domain-c8] Cluster Network Provider is NSXT Container Plugin. Performing additional NCP-specific configuration.

2020-10-13T19:08:05.019Z debug wcp [opID=5f862fa3] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:05.019Z warning wcp [opID=5f85ea04-ea1f] Reflector for Resource:virtualmachineclasses, ClusterID:domain-c8 failed. Err: server/kubelifecycle/reflector/reflector.go:118: Failed to list <unspecified>: Failed to list virtualmachineclasses: Unauthorized. Will retry.

2020-10-13T19:08:05.092Z debug wcp [opID=5f862fa3] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:05.099Z debug wcp [opID=5f86289f-domain-c8] Cluster Network Provider is NSXT Container Plugin. Performing additional NCP-specific configuration.

2020-10-13T19:08:05.161Z debug wcp [opID=5f862fa3] vcrestlib: requesting new session

2020-10-13T19:08:05.264Z debug wcp [opID=5f86289f-domain-c8] Cluster Network Provider is NSXT Container Plugin. Performing additional NCP-specific configuration.

2020-10-13T19:08:05.375Z debug wcp [opID=5f862fa4] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:05.375Z error wcp [opID=5f86289f-domain-c8] Error checking if NSX resources exist. Err: Unauthorized

2020-10-13T19:08:05.376Z error wcp [opID=5f86289f-domain-c8] Error checking if NSX resources exist for VMs: [vm-2050]. Err: Unauthorized

2020-10-13T19:08:05.376Z error wcp [opID=5f86289f-domain-c8] Error creating NSX resources. Err: Unauthorized

2020-10-13T19:08:05.376Z error wcp [opID=5f86289f-domain-c8] Failed to create cluster network interface for MasterNode: VirtualMachine:vm-2050. Err: Unauthorized

2020-10-13T19:08:05.376Z error wcp [opID=5f86289f-domain-c8] Error configuring cluster NIC on master VM vm-2050: Unauthorized

2020-10-13T19:08:05.376Z error wcp [opID=5f86289f-domain-c8] Error configuring API server on cluster domain-c8 Error configuring cluster NIC on master VM. This operation is part of API server configuration and will be retried.

2020-10-13T19:08:05.451Z debug wcp [opID=5f862fa4] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:05.519Z debug wcp [opID=5f862fa4] vcrestlib: requesting new session

2020-10-13T19:08:05.714Z debug wcp [opID=5f862fa6] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:05.715Z error wcp [opID=5f86289f-domain-c8] Error checking if NSX resources exist. Err: Unauthorized

2020-10-13T19:08:05.715Z error wcp [opID=5f86289f-domain-c8] Error checking if NSX resources exist for VMs: [vm-2048]. Err: Unauthorized

2020-10-13T19:08:05.715Z error wcp [opID=5f86289f-domain-c8] Error creating NSX resources. Err: Unauthorized

2020-10-13T19:08:05.715Z error wcp [opID=5f86289f-domain-c8] Failed to create cluster network interface for MasterNode: VirtualMachine:vm-2048. Err: Unauthorized

2020-10-13T19:08:05.715Z error wcp [opID=5f86289f-domain-c8] Error configuring cluster NIC on master VM vm-2048: Unauthorized

2020-10-13T19:08:05.715Z error wcp [opID=5f86289f-domain-c8] Error configuring API server on cluster domain-c8 Error configuring cluster NIC on master VM. This operation is part of API server configuration and will be retried.

2020-10-13T19:08:05.776Z debug wcp [opID=5f862fa6] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:05.847Z debug wcp [opID=5f862fa6] vcrestlib: requesting new session

2020-10-13T19:08:06.048Z debug wcp [opID=5f862fa7] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:06.049Z error wcp [opID=5f86289f-domain-c8] Error checking if NSX resources exist. Err: Unauthorized

2020-10-13T19:08:06.049Z error wcp [opID=5f86289f-domain-c8] Error checking if NSX resources exist for VMs: [vm-2049]. Err: Unauthorized

2020-10-13T19:08:06.049Z error wcp [opID=5f86289f-domain-c8] Error creating NSX resources. Err: Unauthorized

2020-10-13T19:08:06.049Z error wcp [opID=5f86289f-domain-c8] Failed to create cluster network interface for MasterNode: VirtualMachine:vm-2049. Err: Unauthorized

2020-10-13T19:08:06.049Z error wcp [opID=5f86289f-domain-c8] Error configuring cluster NIC on master VM vm-2049: Unauthorized

2020-10-13T19:08:06.049Z error wcp [opID=5f86289f-domain-c8] Error configuring API server on cluster domain-c8 Error configuring cluster NIC on master VM. This operation is part of API server configuration and will be retried.

2020-10-13T19:08:06.049Z warning wcp [opID=5f86289f-domain-c8] Error configuring cluster NIC. Err <nil>

2020-10-13T19:08:06.049Z warning wcp [opID=5f86289f-domain-c8] Error configuring cluster NIC. Err <nil>

2020-10-13T19:08:06.049Z warning wcp [opID=5f86289f-domain-c8] Error configuring cluster NIC. Err <nil>

2020-10-13T19:08:06.049Z info wcp [opID=5f86289f-domain-c8] no single master succeeded - retrying

2020-10-13T19:08:06.049Z debug wcp Publish change event: &cdc.ChangeLogChangeEvent{Resource:std.DynamicID{Type_:"ClusterComputeResource", Id:"domain-c8"}, Kind:"UPDATE", Properties:[]string{"messages"}, ParentResources:[]std.DynamicID(nil)}

2020-10-13T19:08:06.05Z debug wcp [opID=5f86289f] [ END ] [kubelifecycle.(*Controller).syncClusterState:285] [8.645360187s] cluster=domain-c8

2020-10-13T19:08:06.165Z debug wcp [opID=5f862fa5] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:06.299Z debug wcp [opID=5f862fa8] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:06.411Z debug wcp [opID=5f862fa9] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:06.558Z debug wcp [opID=5f862faa] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:06.559Z warning wcp [opID=5f85ea04-ea20] Reflector for Resource:limitranges, ClusterID:domain-c8 failed. Err: server/kubelifecycle/reflector/reflector.go:118: Failed to list <unspecified>: Failed to list limitranges: Unauthorized. Will retry.

2020-10-13T19:08:06.67Z debug wcp [opID=5f862fad] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:06.814Z debug wcp [opID=5f862fac] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:06.932Z debug wcp [opID=5f862fae] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:07.042Z debug wcp [opID=5f862fb0] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:07.043Z warning wcp [opID=5f85ea04-ea23] Reflector for Resource:serviceaccounts, ClusterID:domain-c8 failed. Err: server/kubelifecycle/reflector/reflector.go:118: Failed to list <unspecified>: Failed to list serviceaccounts: Unauthorized. Will retry.

2020-10-13T19:08:07.153Z debug wcp [opID=5f862faf] Getting HOK signer; store: wcp, alias: wcp

2020-10-13T19:08:07.267Z debug wcp [opID=5f862fb1] Getting HOK signer; store: wcp, alias: wcp

Any pointers on resolving this issue?

Thanks

Vineeth

0 Kudos
nachogonzalez
Expert
Expert

Hey, hope you are doing fine

You are having an unauthorized error

Error checking if NSX resources exist. Err: Unauthorized

2020-10-13T19:08:05.376Z error wcp [opID=5f86289f-domain-c8] Error checking if NSX resources exist for VMs: [vm-2050]. Err: Unauthorized

2020-10-13T19:08:05.376Z error wcp [opID=5f86289f-domain-c8] Error creating NSX resources. Err: Unauthorized

Can you check if permissions are correct con NSX, Kubernetes and vSphere compute manager?
Have you accepted all certificates?

Warm regards

0 Kudos
vineethac
Contributor
Contributor

Thanks for the quick response. After finishing the workload management configuration wizard, it was able to deploy 3 supervisor control plane VMs and it also configured a new T1 gateway, and some new segments, NAT rules and LBs in NSX-T manager. So I can't really understand what this unauthorised means!

0 Kudos
pierrevm123
Contributor
Contributor

Does anyone has a clue or some hint to troubleshoot further? I seem to have exactly the same problem.

I have upgraded to the latest vCenter. Running all ESXi hosts on VMware Workstation in my home lab.

0 Kudos
vineethac
Contributor
Contributor

This issue is now resolved. It was due to a missing configuration in NSX-T. Thanks Hari (@hari5611) for identifying it and helping me fix it.

On Tier-0 Gateway added route re-distribution to allow all overlay traffic. This was missing earlier and it fixed the problem.

pastedImage_0.png

Thanks

Vineeth

amdjfk
Contributor
Contributor

So what does this mean? Is this step in the docs anywhere? How did you configure it? 

0 Kudos
Yasen_Simeonov
VMware Employee
VMware Employee

Two most common reasons are:

1. Trust is not enabled in the Compute Manager for this vCenter in NSX.

2. Time between vCenter and NSX is not in sync.

0 Kudos
ausvmguy
VMware Employee
VMware Employee

I get exactly the same thing regardless of whether I configure using NSX-T or vSphere Networking (ie, no NSX-T).

3 nodes up. Only 1 has any appreciable CPU activity (averaging about 50%). The other nodes average 1%.

All nodes have only a single NIC (thus it isn't getting to the point of adding and configuring the 2nd NIC).

Can SSH to the "master" node". "kubectl get nodes" lists a single node. Status is "Ready". Roles is "master. Version if v1.18.2

kubectl get cluster --all-namespaces responds with "No resources found"

Checking the wcpsvc.log file on vCenter, I can't see any errors that would tend to explain what is happening (or not happening). However, then again, I may be looking for the wrong text in the log.

Any suggestions on where to look?

Thanks.

 

0 Kudos
nblr06
Enthusiast
Enthusiast

hey I had same issue "error configuring cluster nic on master vm" too.

Does anyone know whether the "ingress" and "egress" CIDR should be the same as the subnet of edge uplink? Or maybe the ingress and egress CIDR can be any routable VLAN subnet?

I doubt that this issue could be caused by NSX-T network setting but not so sure about any additional routing or SNAT should be considered?

I've tried so many times on deploy workload management cluster, it's frustrating...

(Note that my environment is using static routing for NSX-T edge, not BGP.)

0 Kudos
amdjfk
Contributor
Contributor

I got my system further by giving in and using. BGP. You just can't run this stuff without it. 

Now I get TLS errors. When I try to connect via ssl, TLS hangs during establishing connection. BUT, if I ssh from the Supervisor VM to the Ingress IP, it works. I have no idea what to do from there. 

 

0 Kudos
ausvmguy
VMware Employee
VMware Employee

I ended up getting mine working.

It was an issue with underlying networking (there was ping connectivity to the Management K8 cluster IP,  but the logs indicated an inability of vCenter connect to this IP.
Did some testing. Ping only worked with packet sizes less than 1400 bytes (even though MTU was set at 9000)

The cause was traced to an issue with 2 of my hosts (this is a home lab environment). 2 of the hosts I'm using are Gen 6 NUC's with a USB NIC as the 2nd adapter (and using the VMware fling driver) (The other 2 hosts are Gen 8 NUCs, with 2 onboard NICs).
Even though MTU at the switch layer was set to 9000, the driver for USB NIC's only supports up to 4000.

So, I reconfigured the vDS that has the NSX-T transport VLAN to an MTU of 4000 bytes.

I then redeployed the Workload Management cluster, and it completed successfully (using NSX-T networking).

FYI: I'm not using BGP as a routing protocol (only using static routing between the NSX-T T0 router and the external NBN router.

Hope this helps someone.

amdjfk
Contributor
Contributor

This did the trick. I had all MTU set to 9000 because in the networking world, that works. Is some other component is limited to less, then the result is packets of that lower limit can go through hassel free. But apparently in the VMware networking world, Setting you MTU too high causes problems. Thanks VMware. 

I set the NSX-T Overlay and the Edge Overlay vlans to 1600 MTU. Now things work, I can connect to the K8s interface. 

nblr06
Enthusiast
Enthusiast

my ESXi infrastructure had physical switch and VDS MTU set to 9000, same as nsx-t TN profiles. But the hanging issue still there...😫

However, my DNS server is located on another physical switch(VLAN) of MTU 1700, which is separated from a physical firewall.

 

I'm doubting that whether the underlying network MTU between all the services(such as DNS and vCenter, even they are on VLANs) should be the same or not?

 

0 Kudos