I have an interesting issue with a fresh NSX install.
I have installed NSX manager, deployed the controller nodes and initially, it looks good. However, from the dashboard, it states I have 4 hosts with communication issues:
These 4 servers are specific to the Management cluster, which, according to NSX manager have been successfully prepped and configured. Digging a bit deeper, it's reporting SSL handshake failures:
netcpa logs confirm this as well : 2018-03-14T07:15:06.782Z error netcpa[C86E817700] [Originator@6876 sub=Default] SSL handshake failed on x.x.x.x:0 : error = SSL Exception: error:140000DB:SSL routines:SSL routines:short read
Entries exist for every controller.
There is IP connectivity between host and controller, and Port 1234 is not being blocked.
VTEP and Management are facilitated by separate VLAN's, but these are consistent across both clusters.
There is a VMware KB article that states updating the controller state is a short term fix, but that doesn't work in my environment.
If DNS forward and reverse entries are ok and time is synchronized, the problem may be related to the following KB. (It mentions about an upgrade of controllers but even fresh installation errors are similar):
If similar logs are found as the KB, Update controller state may force renew the certificate:
navigate to Network & Security > Installation > Management > NSX Manager > Actions > Update Controller State to pick up the new certificate.
Does the vsm.log contains similar to:
2017-06-06 17:10:50.785 GMT+00:00 ERROR NVPStatusCheck NvpRestClientManagerImpl:794 - nvp controller node (172.16.0.10) return error org.springframework.web.client.ResourceAccessException: I/O error on GET request for "https://172.16.0.10:443/ws.v1/control-cluster/node?fields=cluster_mgmt_listen_addr,uuid,tags": Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out
And the controller logs:
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.
The following error codes are supported:
1255602: Incomplete Controller Certificate
1255603: SSL Handshake Failure
1255604: Connection Refused 1255605: Keep-alive Timeout
1255606: SSL Exception 1255607: Bad Message 1255620: Unknown Error
Think missed the llast sentence about the KB, if this is the same KB, then the issue of certificate handshake could be about the trust store, are the certificates used self signed?
The NSX Manager uses a Java Keystore to store the certificates it has provisioned. Other NSX components, such as the NSX controllers leverage encrypted and password protected PEM files to store their certificates.
As previously stated - "There is a VMware KB article that states updating the controller state is a short term fix, but that doesn't work in my environment."
However, I decided to completely re-deploy my controller cluster, no more ssl handshake issues for now...