VMware Cloud Community
stevewalker74
Enthusiast
Enthusiast
Jump to solution

Should the /psc URL work on both HA load balanced PSC nodes?

I have run into a strange issue which occurs following the enablement of two PSC 6.5 nodes in an HA configuration as part of a rolling upgrade from 5.5.

The first PSC node in a new site was migrated from an original Window vCenter 5.5 SSO to PSC 6.5, and subsequently a second new node was joined to the first site in order for replication to be established. I'm using a Citrix NetScaler to load balance the configuration and I noticed at some point after the successful HA repointing was done that I am unable to access the https://hosso01.sbcpureconsult.internal/psc URL. The second node, https://hosso2.sbcpureconsult.internal/psc works correctly and redirects to the load balanced address psc-ha-vip.sbcpureconsult.internal for authentication before displaying the PSC client UI. Irrespective of whichever node is selected I am able to log in to vCenter, then choose Administration, System Configuration, select a node then Manage, Settings or CA without receiving any errors.

If I deliberately drop the first node out of the load balancing config on the NetScaler I don't have any issues when accessing the /psc URL by either host name or load balancer name, but if I try to connect to the first node by its own DNS name or IP I get an HTTP 400 error and the following entry in:

/storage/log/vmware/psc-client/psc-client.log

[2018-10-08 12:05:20.347] [ERROR] tomcat-http--3 com.vmware.vsphere.client.security.websso.MetadataGeneratorImpl - Error when creating idp metadata.

java.lang.RuntimeException: java.io.IOException: HTTPS hostname wrong:  should be <psc-ha-vip.sbcpureconsult.internal>

It appears that the HTTP 400 error is because the psc-client Tomcat application doesn't start up correctly on the first node anymore, along with an error in..

/storage/log/vmware/rhttpproxy/rhttpproxy.log

2018-10-08T13:27:10.691Z warning rhttpproxy[7FEA4B941700] [Originator@6876 sub=Default] SSL Handshake failed for stream <SSL(<io_obj p:0x00007fea2c098010, h:27, <TCP '192.168.0.117:443'>, <TCP '192.168.0.121:26417'>>)>: N7Vmacore3Ssl12SSLExceptionE(SSL Exception: error:140000DB:SSL routines:SSL routines:short read)

I've repeated the same steps in my lab environment that I experienced in the customer site and can confirm the same behaviour. Let me explain however that all other vCenter functionality is correct and this issue only affects the /psc URL.

Could this be deemed 'correct' behaviour? If I choose https://psc-ha-vip.sbcpureconsult.internal/psc (which is the load balancer address) I am initially only able to connect if the second node is online and happens to be selected.

I am happy to provide detailed steps of the upgrade process, but in the first case I would like to confirm that it should be possible to access the /psc URL on each node deliberately?

Reply
0 Kudos
1 Solution

Accepted Solutions
stevewalker74
Enthusiast
Enthusiast
Jump to solution

Good news, I was able to roll back my lab and re-run the updateSSOConfig.py and UpdateLsEndpoint.py scripts - only to find that the /psc URL did indeed load successfully on both nodes with the NetScaler load balancing in place. So at least I know that the correct behaviour is that you should be able to open /psc on both appliances.

By examining my snapshots at different stages I have now been able to identify a difference between the original migration node and the clean appliance:

When you run the updateSSOconfig.py Python script to repoint the SSO URL to the load balanced address it explains that hostname.txt and server.xml were modified:

# python updateSSOConfig.py --lb-fqdn=psc-ha-vip.sbcpureconsult.internal

script version:1.1.0

executing vmafd-cli command

Modifying hostname.txt

modifying server.xml

Executing StopService --all

Executing StartService --all

I was able to locate hostname.txt files (containing the load balancer address) in:

/etc/vmware/service-state/vmidentity/hostname.txt

/etc/vmware-sso/keys/hostname.txt (missing on node 2, but contained the local name on node 1)

/etc/vmware-sso/hostname.txt

but this second hostname file was missing on the second node. Why is this? I guess that it is used transiently during the script execution in order to inject the correct value into the server.xml file.

The server XML file is located in the folder:

/usr/lib/vmware-sso/vmware-sts/conf/server.xml

my faulty node contained the following certificate entries under the connector definition:

..store="STS_INTERNAL_SSL_CERT"

certificateKeystoreFile="STS_INTERNAL_SSL_CERT"..

my working node contained:

..store="MACHINE_SSL_CERT"

certificateKeystoreFile="MACHINE_SSL_CERT"..

So I was able to simply copy the server.xml file from the working node (overwriting the original on the faulty node) and also remove the /etc/vmware-sso/keys/hostname.txt file to match the configuration. Following a reboot my first SSO node now responds correctly by redirecting https://hosso01.sbcpureconsult.internal/psc to https://psc-ha-vip.sbcpureconsult.internal/websso to obtain its SAML token before ultimately displaying the PSC client UI.

As a follow up, by examining the STS_INTERNAL_SSL_CERT store I can see that it was issued by the original Windows vCenter Server 5.5 SSO CA to the subject name:

ssoserver,dc=vsphere,dc=local

This store is not present on the other node, and so the correct load balancing certificate replacement must somehow be omitted by one of the upgrade scripts when this scenario occurs (5.5 SSO to 6.5 PSC).

Hope that helps someone else one day!

View solution in original post

Reply
0 Kudos
7 Replies
Vijay2027
Expert
Expert
Jump to solution

To configure 6.5 PSC's in HA mode did you replace machine ssl certificates on both the PSC's?

Ref: VMware Knowledge Base

Reply
0 Kudos
stevewalker74
Enthusiast
Enthusiast
Jump to solution

Yes, the certificate on both nodes now contains subject alternate names as follows:

/usr/lib/vmware-vmafd/bin/vecs-cli entry list --store MACHINE_SSL_CERT --text

X509v3 Subject Alternative Name:

  DNS:hosso01.sbcpureconsult.internal, DNS:hosso02.sbcpureconsult.internal, DNS:psc-ha-vip.sbcpureconsult.internal

Reply
0 Kudos
Vijay2027
Expert
Expert
Jump to solution

Hope you used the same certs on both the psc's.

Run the below commands on both PSc's and verify if the endpoints are pointed to LB vip.

# /usr/lib/vmidentity/tools/scripts/lstool.py list --url https://localhost:7080/lookupservice/sdk --site sitename --type cs.license | grep "URL:"

# /usr/lib/vmidentity/tools/scripts/lstool.py list --url https://PSC_FQDN/lookupservice/sdk --site sitename --type cs.identity | grep "URL:"

Get-site name by running the below command:

/usr/lib/vmware-vmafd/bin/vmafd-cli get-site-name --server-name localhost

Reply
0 Kudos
stevewalker74
Enthusiast
Enthusiast
Jump to solution

Yes, the certificates were generated on the first node and copied to the second.

The first command you provided (shown on next line) needed to be modified to plain http instead of https otherwise I would get a 'com.vmware.vim.vmomi.client.exception.SslException: javax.net.ssl.SSLException: Server certificate chain not verified (no details)'

/usr/lib/vmidentity/tools/scripts/lstool.py list --url http://localhost:7080/lookupservice/sdk --site Default-First-Site --type cs.license | grep "URL:"

  URL: https://psc-ha-vip.sbcpureconsult.internal:443/ls/sdk

  URL: https://psc-ha-vip.sbcpureconsult.internal:443/ls/ph/sdk

  URL: https://psc-ha-vip.sbcpureconsult.internal:443/ls/healthstatus

  URL: https://psc-ha-vip.sbcpureconsult.internal:443/ls/resourcebundle

  URL: https://psc-ha-vip.sbcpureconsult.internal:443/ls/sdk

  URL: https://psc-ha-vip.sbcpureconsult.internal:443/ls/healthstatus

  URL: https://psc-ha-vip.sbcpureconsult.internal:443/ls/ph/sdk

  URL: https://psc-ha-vip.sbcpureconsult.internal:443/ls/resourcebundle

/usr/lib/vmidentity/tools/scripts/lstool.py list --url https://psc-ha-vip.sbcpureconsult.internal/lookupservice/sdk --site Default-First-Site --type cs.identity | grep "URL:"

  URL: https://psc-ha-vip.sbcpureconsult.internal/sts/STSService/vsphere.local

  URL: https://psc-ha-vip.sbcpureconsult.internal/sso-adminserver/sdk/vsphere.local

  URL: https://psc-ha-vip.sbcpureconsult.internal/sso-adminserver/sdk/vsphere.local

  URL: https://psc-ha-vip.sbcpureconsult.internal/websso/SAML2/Metadata/vsphere.local

  URL: https://psc-ha-vip.sbcpureconsult.internal/websso/HealthStatus

  URL: https://psc-ha-vip.sbcpureconsult.internal/sso-adminserver/idp

  URL: https://psc-ha-vip.sbcpureconsult.internal/openidconnect/vsphere.local/.well-known/openid-configurat...

  URL: https://psc-ha-vip.sbcpureconsult.internal/idm

  URL: https://psc-ha-vip.sbcpureconsult.internal/sso-adminserver/sdk/vsphere.local

  URL: https://psc-ha-vip.sbcpureconsult.internal/sso-adminserver/sdk/vsphere.local

  URL: https://psc-ha-vip.sbcpureconsult.internal/sso-adminserver/idp

  URL: https://psc-ha-vip.sbcpureconsult.internal/sts/STSService/vsphere.local

  URL: https://psc-ha-vip.sbcpureconsult.internal/websso/HealthStatus

  URL: https://psc-ha-vip.sbcpureconsult.internal/openidconnect/vsphere.local/.well-known/openid-configurat...

  URL: https://psc-ha-vip.sbcpureconsult.internal/websso/SAML2/Metadata/vsphere.local

  URL: https://psc-ha-vip.sbcpureconsult.internal/idm

I've repeated both checks on each node, and the results are identical. I appreciate your assistance so far!

Reply
0 Kudos
stevewalker74
Enthusiast
Enthusiast
Jump to solution

Here's the text displayed during the HTTP error, accessing /psc on the load balanced node:

HTTP Status 400 – Bad Request

Type Status Report

Message An error occurred while sending an authentication request to the PSC Single Sign-On server - null

Description The server cannot or will not process the request due to something that is perceived to be a client error (e.g., malformed request syntax, invalid request message framing, or deceptive request routing).

Apache Tomcat/8.5.13

Reply
0 Kudos
stevewalker74
Enthusiast
Enthusiast
Jump to solution

Further to the outline above, I have rolled back my lab snapshots and repeated the process of creating new machine certificates and applying them to the PSC nodes. The log files showing the certificate replacement process have been attached and match the process explained by the VMware article quoted above.

Interestingly the number of services updated during the machine cert replacement process are different for each of the two nodes:

Node 1:

Updated 7 service(s)

Status : 100% Completed [All tasks completed successfully]

Node 2:

Updated 10 service(s)

Status : 100% Completed [All tasks completed successfully]

Additionally, at the end of the replacement process you'll see in the logs that the pschealth service does not start correctly on node 1, but can be manually restarted. Following a reboot the psc-client service is stopped also on node 1 but can again be manually restarted. Prior to the certificate replacement process the service was stable. This is the node which was created by running the appliance installer in migration mode from an existing Windows vCenter Server 5.5 SSO instance.

Once the certificate replacement process is complete and the psc-client/pschealth services are started manually I'm able to log in independently to both SSO nodes using the /psc URL. As this point I'm wondering if the psc-client and pschealth services arbitrate between active nodes in a site to decide which one will run the /psc URL?

I will continue with the updateSSOConfig.py and UpdateLsEndpoint.py scripts in the lab to see if once again I can repeat the behaviour explained at the beginning of this post, but it's happened twice now in the lab and customer environments so I think it's reproducible.

Reply
0 Kudos
stevewalker74
Enthusiast
Enthusiast
Jump to solution

Good news, I was able to roll back my lab and re-run the updateSSOConfig.py and UpdateLsEndpoint.py scripts - only to find that the /psc URL did indeed load successfully on both nodes with the NetScaler load balancing in place. So at least I know that the correct behaviour is that you should be able to open /psc on both appliances.

By examining my snapshots at different stages I have now been able to identify a difference between the original migration node and the clean appliance:

When you run the updateSSOconfig.py Python script to repoint the SSO URL to the load balanced address it explains that hostname.txt and server.xml were modified:

# python updateSSOConfig.py --lb-fqdn=psc-ha-vip.sbcpureconsult.internal

script version:1.1.0

executing vmafd-cli command

Modifying hostname.txt

modifying server.xml

Executing StopService --all

Executing StartService --all

I was able to locate hostname.txt files (containing the load balancer address) in:

/etc/vmware/service-state/vmidentity/hostname.txt

/etc/vmware-sso/keys/hostname.txt (missing on node 2, but contained the local name on node 1)

/etc/vmware-sso/hostname.txt

but this second hostname file was missing on the second node. Why is this? I guess that it is used transiently during the script execution in order to inject the correct value into the server.xml file.

The server XML file is located in the folder:

/usr/lib/vmware-sso/vmware-sts/conf/server.xml

my faulty node contained the following certificate entries under the connector definition:

..store="STS_INTERNAL_SSL_CERT"

certificateKeystoreFile="STS_INTERNAL_SSL_CERT"..

my working node contained:

..store="MACHINE_SSL_CERT"

certificateKeystoreFile="MACHINE_SSL_CERT"..

So I was able to simply copy the server.xml file from the working node (overwriting the original on the faulty node) and also remove the /etc/vmware-sso/keys/hostname.txt file to match the configuration. Following a reboot my first SSO node now responds correctly by redirecting https://hosso01.sbcpureconsult.internal/psc to https://psc-ha-vip.sbcpureconsult.internal/websso to obtain its SAML token before ultimately displaying the PSC client UI.

As a follow up, by examining the STS_INTERNAL_SSL_CERT store I can see that it was issued by the original Windows vCenter Server 5.5 SSO CA to the subject name:

ssoserver,dc=vsphere,dc=local

This store is not present on the other node, and so the correct load balancing certificate replacement must somehow be omitted by one of the upgrade scripts when this scenario occurs (5.5 SSO to 6.5 PSC).

Hope that helps someone else one day!

Reply
0 Kudos