VMware Cloud Community
GotToBeStrong
Enthusiast
Enthusiast

Catch-up Upgrade/Migration - Advice/Guidance Needed

Good Day.  I'm up against a catch-up upgrade/migration of our corp ofc data center VMware cluster and vCenter server.  To the credit of the VSphere group of products our current production stack has been running vShpere 6.0 for 10 years without issue.  I'm now finally able to upgrade the hardware and roll the environment forward to 8.0 however I have a (hopefully small) issue about which I'm looking for some guidance or advice.  I've been managing this environment since I built it on ESXi 2.x, so I'm very familiar with a vast majority of what I'm dealing with, however I've run into a problem that even I can't figure out - which is causing me to hesitate jumping into this upgrade.

(Background):  Due to budgetary circumstances beyond our control we were forced to let our VMWare support licensing lapse just before 6.5 was released (which forced a license migration) so we were left out of being able to step forward from 6.0 for quite a while.  During that time period we ran into a not uncommon circumstance of having the STS/solution certificates expire on our Windows vCenter Server.  We were able to renew the certificates and resurrect vCenter services, however somehow the certificate for the web client was never replaced, so when I navigate to the web client hosted by our Windows vCenter server I receive a 500 (null) Server Error and if I check the certificate that is being presented by the vCenter web server it is (edit after double checking) a currently valid certificate.  (See new info at end.)

In addition to this the virgo log has entries such as:
c.v.v.s.c.impl.SecurityTokenServiceImpl$RequestResponseProcessor                  Provided credentials are not valid
and
com.vmware.vise.vim.security.sso.impl.NgcSolutionUser                       Login as solution user failed

I had never noticed the web site issue before because I've been using the old Java client for forever and haven't needed the web interface (which was clunky and incomplete with version 6.0 anyway).  During this same timespan Flash and IE have been sunset, so even if I could resurrect the web client site I probably wouldn't be able to run it at all because I wouldn't be able to get a browser to run flash.  Additionally, the virgo log errors above make me think that there may still be some buried certificate issue if the user credentials in question are certificate based and the certificate is expired (I'm unsure, so this is a guess).

That said, the first step in my migration plan must be to upgrade vCenter from 6.0 to 6.7 which will allow me to begin stepping forward the remainder of our environment.  I now have current licensing (which I can downgrade or simply utilize the trial period during our migration process until I have everything up to 7/8), so that's not an issue, however anything vSphere 6.x is EoL and End of Technical Guidance so I'm not sure support will help me here.  My hesitation is being introduced by this certificate issue with the web client site in addition to having previously attempted to run a vCenter upgrade installation from ISO on this same Windows vCenter Server in the past, the in-place upgrade installation failed citing some certificate was invalid and would not continue.  I assumed at the time it was one of the STS/Solution User certificates, however was never able to verify exactly which certificate was tripping up the upgrade installation executable.  This, coupled with the fact that the last time I had to replace the STS certificates, I could not use the built in certificate manager wizard for some reason, the automatic workflow would fail at the end regardless of anything else.  I ended up having to remove and replace every single certificate individually by command line including the CA certs.  All STS certificates presently installed (and working for the services) are internal PKI certificates and valid for the next few years.  No expired certificates exist in any of the STS certificate stores - however an expired certificate is being presented by the web client site.

So what I'd like to do (and here is where I need some advice/guidance) is to skip the in place upgrade and migrate to a newer version (6.7U3) VCSA - essentially absorb my existing vCenter database and for the sake of what might seem like a reboot (or minimal disruption) of vCenter to my hosts.  We also use Veeam for backups, so I would probably need to re-attach that - however that is more than likely trivial as I've corrected this after replacing the solution certificates.

I've seen walk-throughs on the migration process and it seems rather straight forward, however what still causes my hesitation is that pesky certificate issue, coupled with the fact that a certificate issue may be potentially causing the other errors. I'm unsure if a migration will fail due to this, if the migration wizard checks for more than just the STS certificates and would recognize or notify me of such a deep problem if one existed.  I'm also curious if, after migration, the services (daemons) running on VCSA use the same service user login logic as the Window platform does or if those windows service users can be retired; likewise if they are abandoned with VCSA then I will feel better about the chances of something not going awry except for the buried, expired cert.

To further compound the situation, our global backups use Veeam which relies on VCenter server - if vCenter isn't working Veeam is out to lunch, so being able to do a backup or a snapshot would be problematic at best.  I want to puzzle this through 'on paper' as it were - with someone who has dealt with this (or similar) scenario before.  There has got to be something that I'm not thinking of - for good or for bad - that I need to take into consideration and I'm hoping for some of this insight from the community.

Has anyone ever run into this type of scenario before, migrating away from a vCenter server that has had and potentially still has expired certificates or other certificate related problems?

(ADDITIONAL INFORMATION):
I attempted to run the migration wizard this morning and have found a little more granular information about my certificate problem:

The powershell command to review all of the (VCInstall Store list) certificates returns all valid certificates - Valid through a future date, so not expired.  All certificates are as previously mentioned internal PKI signed certificates; none of this is public facing so I had no need for public certificates. However the migration wizard's pre-migration check fails citing an expired certificate.  A quick Google search leads me to VMWare KB 68155 which is essentially [replace the STS_INTERNAL_SSL_CERT with the machine cert from the MACHINE_SSL_CERT store].

Step 1 in this article has me using OpenSSL to connect to localhost:7444 and retrieve the currently published certificate.  The certificate presented by this web service IS INDEED EXPIRED and not one of the certificates returned by the powershell script above.  Furthermore the same article outlines steps on backing up the Machine_Cert and STS_Internal_SSL_Cert then replacing the internal SSL cert with the Machine cert.  When backing up the STS_INTERNAL_SSL_CERT - if I review this cert - what is exported by the backup command is a valid cert and not the same that is presented by the web service on port 7444.

So this may be my underlying problem - I finally found an expired certificate buried in my deployment somewhere, however I have no immediate idea how to approach changing this particular certificate as it seems to be an artifact and was not replaced when I updated all of the Service Solution certificates a few months ago.

So a more simple question might be - how do I replace the specific certificate that is being presented by web service on port 7444?

Plese advise,   Thank you,

0 Kudos
4 Replies
GotToBeStrong
Enthusiast
Enthusiast

Apologies if that inquiry was a little drawn out.  I've been working on this and have managed to make some progress.  I found and replaced one expired certificate using this KB:  [Replacing the Lookup Service SSL certificate on a Platform Services Controller 6.0 (2118939)].
Now, it seems as if I still have another orphaned certificate somewhere, I'm still finding errors citing bad credentials in log files for vAPI and Web services.  The fix above was for replacing the certificate in the Lookup Service and to boil down the entire KB article I essentially had to replace a .p12 certificate file in the config directory for LS and restart the services.

As mentioned, I found another (or a copy) of the expired certificate in the CFG directory for SSO (programdata\vmware\vcenterserver\cfg\sso\keys).  The same set of files (p12 and accompanying keys) live in this directory as the lookup service directory that I just fixed with the KB above.  As such, I'm assuming that the certificate for this service (SSO) may be manipulated the same as the other service - (replace file, restart service).

Question is:  Can I simply replace the .P12 certificate file (and/or key files) in this directory (cfg\sso\keys) with the same certificate that I just generated in the above KB for the lookup service (and restart the service(s) to implement)?
It is an internal PKI signed cert and for all intents and purposes should work for this service as well as it did for the lookup service.  There was nothing specific about the lookup service built into the cert.  The cert is essentially the same as the machine cert, just the server FQDN and signed by our internal CA.  For what it's worth, the certificate was signed using the certificate template that we configured for setting up the whole SSO platform back when vSphere version 5 required it.
What I don't know is this:  If I go ahead and replace the P12 file (and key files) in the cfg\sso\keys directory and restart the services again, if something goes wrong, or this doesn't work as intended to bring the last few services back online - what are the chances that this will invalidate my good, working configuration in place today in such a way as to be unrecoverable or difficult to reverse?

I hesitate to just blindly make changes in this scenario, if our vCenter server goes offline or becomes inoperable for any reason I'm going to have a bigger problem on my hands which, given this repair is step one of a major migration/upgrade - is the last thing I need...

I'm hoping someone with some knowledge about the inner-workings of this service can tell me if replacing this certificate in this manner is OK and/or if it can cause any problems.  The short and sweet of the KB article was basically swap file and restart service - to which this directory looks the same with only a few certificate files in it.  So, I'm hoping it is a safe assumption that the fix (replacing the certificate) is essentially the same.  Good file in correct location = working service.

Please advise, 

Thank you,

0 Kudos
GotToBeStrong
Enthusiast
Enthusiast

So the migration assistant prechecks succeeded and I attempted a migration yesterday which blew up nicely about 90% of the way through the first-boot scripts.  Ultimately the failure was VMware Identity Service firstboot scripts on the new appliance.  The GUI exported logs from the process:
(vmidentity-firstboot.py_10409_stdout.log):  Failed to add STS SSL certificate to VECS.

I have since found and run the script ls_ssltrust_fixer_p3.py which essentially recreates the trust anchors for all of the identity services - I believe we had 28 items repaired by this tool.  After the script fixed the trust anchors I did not observe any changes in the behavior of the server.  vCenter is functional, although some of the ancilliary web pages (/websso/healthstatus) return XML (<status>green</status>) others return errors such as 'Unexpected EOF at prolog'...  All pages have good SSL as far as the browser is concerned, we are using internal PKI/certs for all of this.

I have no indication that this will make any difference with the migration.  The fix was related to the VMware Identity Services and the migration failure was citing the Security Token Service.  I'm going to investigate the STS on the source a bit more in depth.

Narrowing it down I hope but still would like some insight.

0 Kudos
GotToBeStrong
Enthusiast
Enthusiast

So, I found another buried expired certificate in the STS service.  I found and followed this KB (Generate a New STS Signing Certificate on a vCenter Windows Installation (vmware.com)) to replace said certificate, however I now run into this chicken/egg scenario where the next step in the process is this KB (Refresh the Security Token Service Certificate (vmware.com)) which requires the web UI and I don't have a working web UI.  I have a [500] SSO error: null message.  The 2nd conundrum is the version of the web site that should be working here is (if I recall) flash which isn't working anymore, so even if I didn't have the 500 error I would run into that.

So, I'm hoping there is a command line way to accomplish what this KB article is wanting me to do via the web site, which is upload the JKS file I created containing new certificates to somewhere.  The web site is only a pretty front end for API execution anyway.  The end result still has to be that this .jks file lives somewhere in the file system and have pointers to it in properties/config files.  Even if it were more complicated than that - if this import brings it into the database as a binary object or something, I'm quite familiar with SQL and can accomplish that easily enough if needed.

This site (https://VCSERVER.LOCAL.DOMAIN/sts/STSService/vsphere.local) works, perhaps there is a way to construct a URL that will accomplish what the web UI would.

0 Kudos
NateNateNAte
Hot Shot
Hot Shot

tldr

I appreciate that you're providing a ridiculous amount of detail (yay!) but at the same time...there's too much for me to read through 4 posts.  

If I can hopefully summarize (the problem from the initial post), you want to migrate from version 6 to 8.  There are support paths.  But it looks like you're also running into some certificate and back-up/operations issues. 

I would recommend that to save your production environment and maintain availability, you will have to do a 'leapfrog' upgrade where you have a side environment with the 6.7 vCSA and some appropriately versioned ESXi hosts where you can 'store' VMs as you make the migration push from the current v6 to a 'new' and ready v8 environment. 

The side environment is just a parking lot for production VMs as you upgrade them and move them to v8.  That environment doesn't have to be very large.  The v8 environment should start small (Assuming you have resources available to cut away and make those 'new' v8 resources. And as you migrate VMs from the v6 environment, consolidate hosts, and move 'empty/available' hosts to the v8 environment and add them as new resources there. As your v6 environment shrinks, your v8 should grow, until the v6 environment is just the vCS and you can decommission that.

I know it's probably not the answer you're looking for, but you need to plan out the effort, before you make this kind of upgrade - otherwise you'll continue down the rabbit hole your additional posts seem to be narrating.  

0 Kudos