Highlighted
Enthusiast
Enthusiast

Control Plane Agent to Controller Down

Hi all,

I have an interesting issue with a fresh NSX install.

Environment details:

  • vCenter 6.5
  • ESXi 6.5
  • NSX 6.4

Two clusters

  • Management and Edge (4 hosts)
  • Compute (4 hosts)

I have installed NSX manager, deployed the controller nodes and initially, it looks good. However, from the dashboard, it states I have 4 hosts with communication issues:

pastedImage_2.png

These 4 servers are specific to the Management cluster, which, according to NSX manager have been successfully prepped and configured. Digging a bit deeper, it's reporting SSL handshake failures:

pastedImage_3.png

netcpa logs confirm this as well : 2018-03-14T07:15:06.782Z error netcpa[C86E817700] [Originator@6876 sub=Default] SSL handshake failed on x.x.x.x:0 : error = SSL Exception: error:140000DB:SSL routines:SSL routines:short read

Entries exist for every controller.

There is IP connectivity between host and controller, and Port 1234 is not being blocked.

VTEP and Management are facilitated by separate VLAN's, but these are consistent across both clusters.

There is a VMware KB article that states updating the controller state is a short term fix, but that doesn't work in my environment.

Any ideas?

Thanks,

0 Kudos
3 Replies
Highlighted
Expert
Expert

If DNS forward and reverse entries are ok and time is synchronized, the problem may be related to the following KB. (It mentions about an upgrade of controllers but even fresh installation errors are similar):

If similar logs are found as the KB, Update controller state may force renew the certificate:

https://kb.vmware.com/s/article/2151089

  navigate to Network & Security > Installation > Management > NSX Manager > Actions > Update Controller State to pick up the new certificate.

Does the vsm.log contains similar to:

2017-06-06 17:10:50.785 GMT+00:00 ERROR NVPStatusCheck NvpRestClientManagerImpl:794 - nvp controller node (172.16.0.10) return error org.springframework.web.client.ResourceAccessException: I/O error on GET request for "https://172.16.0.10:443/ws.v1/control-cluster/node?fields=cluster_mgmt_listen_addr,uuid,tags": Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out

And the controller logs:

  • 2017-06-06 18:32:50,347 19123181348 [listener] INFO com.vmware.controller.server.Listener - Accept Connection [ip=172.24.2.26:46115, cnnId=21264] from /172.24.2.26:46115
    2017-06-06 18:32:50,357 19123181358 [reader 3] ERROR com.vmware.controller.server.ssl.SelfSignedX509TrustManager - Unknow chassis certificate: [
    [
    Version: V3
    Subject: CN="VMWare VXLAN Host Certificate host-11573 OU=Nectworking O=VMWare ST=CA C=US"
    Signature Algorithm: SHA256withRSA, OID = 1.2.840.113549.1.1.11
    Key: Sun RSA public key, 2048 bits

    modulus: 22911650522799465929163707326918080254704523027188317203645647153931638466371122064197258058841116911989320009855294745617721779386019557021249605122136935010401
    36836560115024772432023329796195620130983113379731661924922830333592692791543147876405959524921451570805385813377696469386291738246946920048747704248124484079384552745316
    66112531666589757995492441394796111464829401754007815754348273682553447185738440211794264079252464938057216938803523707224061663150480722911564461043934851115967587589348
    39992978266706878205075684179188691037974878624050280597452927405166323249390673946856460750742686036206044340415301
    public exponent: 65537

    Validity: [From: Fri Apr 28 10:14:16 UTC 2017,
    To: Tue Sep 13 10:14:16 UTC 2044]
    Issuer: CN="VMWare VXLAN Host Certificate host-11573 OU=Nectworking O=VMWare ST=CA C=US"
    SerialNumber: [ 015bb40d d45c]

    >2017-06-07T14:28:04.785693+00:00 2017-06-07 14: 28:04,785 19194224947 [reader 1] ERROR com.vmware.controller.server.ssl.SelfSignedX509TrustManager - Unknow chassis certificate: [#012[#012 Version: V3#012 Subject: CN="VMWare VXLAN Host Certificate host-11573 OU=Nectworking O=VMWare ST=CA C=US"#012 Signature Algorithm: SHA256withRSA, OID = 1.2.840.113549.1.1.11#012#012
    Key: Sun RSA public key, 2048 bits#012 modulus: 229116505227994659291637073269180802547045230271883172036456471539316384663711220641972580588411169
    119893200098552947456177217793860195570212496051221369350104013683656011502477243202332979619562013098311337973166192492283033359269279154314787640595..

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.


Cause


This issue occurs when the Controller fails to authenticate the certificate of the host causing the handshake to fail.


Resolution


This issue is resolved in VMware NSX for vSphere 6.3.5, available at VMware Downloads.

To work around this issue if you do not want to upgrade,  navigate to Network & Security > Installation > Management > NSX Manager > Actions > Update Controller State to pick up the new certificate.

The following error codes are supported:

1255602: Incomplete Controller Certificate 
1255603: SSL Handshake Failure
1255604: Connection Refused 1255605: Keep-alive Timeout
1255606: SSL Exception 1255607: Bad Message 1255620: Unknown Error
0 Kudos
Highlighted
Expert
Expert

Think missed the llast sentence about the KB, if this is the same KB, then the issue of certificate handshake could be about the trust store, are the certificates used self signed?

Secure Configuration of NSX 2017

NSX-v 6.3.x - Security Configuration Guide (Published version 2.1)

The NSX Manager uses a Java Keystore to store the certificates it has provisioned. Other NSX components, such as the NSX controllers leverage encrypted  and password protected PEM files to store their certificates.

0 Kudos
Highlighted
Enthusiast
Enthusiast

As previously stated - "There is a VMware KB article that states updating the controller state is a short term fix, but that doesn't work in my environment." Smiley Wink

However, I decided to completely re-deploy my controller cluster, no more ssl handshake issues for now...

0 Kudos