cornwella
Contributor
Contributor

VCSA 6.5 - "Failed to start vpxd-svcs, vapi-endpoint services. Error: Operation timed out" after Cert Renewal

Jump to solution

We have a single on-premises VCSA 6.5 instance that recently ran into the certificate expiration detailed in this KB:

https://kb.vmware.com/s/article/76719

All the certificates have been regenerated using the certificate-tool via the CLI, and now show up as up-to-date using the one-liner in the above KB (they were all previously expired a week ago):

STORE MACHINE_SSL_CERT

Alias : __MACHINE_CERT

            Not After : Aug 18 19:56:50 2022 GMT

STORE TRUSTED_ROOTS

Alias : 9bd7b30bcb1dcecfe2491a3e91fcd3dd756f347f

            Not After : Aug  1 13:58:01 2028 GMT

Alias : c0af9d76ae9fab214298c6b11d4efb72f64b6c13

            Not After : Aug 13 18:18:55 2030 GMT

Alias : ac50bb369ff7dce7e8c372b9b3e50f6e3aaaa528

            Not After : Aug 13 18:20:03 2030 GMT

Alias : 3e816060d6322a45114eac30798edbf1a4a1397d

            Not After : Aug 13 18:28:26 2030 GMT

Alias : 074ddc83baeea4c6588f3f11837ed4fc77b25220

            Not After : Aug 13 19:21:38 2030 GMT

Alias : 4bbaf83d23a818f2e8122b60ca0edc6dabf76d7d

            Not After : Aug 13 19:33:49 2030 GMT

STORE TRUSTED_ROOT_CRLS

Alias : a45f284d7b9325005381b1b14d3ac3c823e104c9

Alias : 4b3b32cf9bb0d212aa6551bdd97dd3aaf029dde5

Alias : 02c60981250d68d94e1fcd31c93d0c50ae26d531

Alias : c4df908ec94dc3b1b774ca4a8768acfdbee90e59

Alias : f65b7ab274c5d949e8e914101797260d9e40fd70

Alias : 84d8635a51db3a011bab257873555c6776381d37

STORE machine

Alias : machine

            Not After : Aug 18 19:12:42 2022 GMT

STORE vsphere-webclient

Alias : vsphere-webclient

            Not After : Aug 18 19:12:43 2022 GMT

STORE vpxd

Alias : vpxd

            Not After : Aug 18 19:12:43 2022 GMT

STORE vpxd-extension

Alias : vpxd-extension

            Not After : Aug 18 19:12:44 2022 GMT

STORE SMS

Alias : sms_self_signed

            Not After : Aug  7 14:06:21 2028 GMT

STORE BACKUP_STORE

Alias : bkp___MACHINE_CERT

            Not After : Aug 18 19:11:39 2022 GMT

Alias : bkp_machine

            Not After : Aug 18 19:12:42 2022 GMT

Alias : bkp_vsphere-webclient

            Not After : Aug 18 19:12:43 2022 GMT

Alias : bkp_vpxd

            Not After : Aug 18 19:12:43 2022 GMT

Alias : bkp_vpxd-extension

            Not After : Aug 18 19:12:44 2022 GMT

When I try to start all services now, it returns the following after ~5 minutes:

Service-control failed. Error Failed to start vmon services.vmon-cli RC=1, stderr=Failed to start vpxd-svcs, vapi-endpoint services. Error: Operation timed out

When using service-control to start just the vpxd-svcs service by itself, it returns the following error:

Perform start operation. vmon_profile=None, svc_names=['vmware-vpxd-svcs'], include_coreossvcs=False, include_leafossvcs=False

2020-08-18T21:10:50.484Z   Service vpxd-svcs state STOPPED

Error executing start on service vpxd-svcs. Details {

    "resolution": null,

    "detail": [

        {

            "args": [

                "vpxd-svcs"

            ],

            "id": "install.ciscommon.service.failstart",

            "localized": "An error occurred while starting service 'vpxd-svcs'",

            "translatable": "An error occurred while starting service '%(0)s'"

        }

    ],

    "componentKey": null,

    "problemId": null

}

Service-control failed. Error {

    "resolution": null,

    "detail": [

        {

            "args": [

                "vpxd-svcs"

            ],

            "id": "install.ciscommon.service.failstart",

            "localized": "An error occurred while starting service 'vpxd-svcs'",

            "translatable": "An error occurred while starting service '%(0)s'"

        }

    ],

    "componentKey": null,

    "problemId": null

}

The web UI returns the following 503 error (which it has been returning since the certs expired):

503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http20NamedPipeServiceSpecE:0x000056033c080640] _serverNamespace = / action = Allow _pipeName =/var/run/vmware/vpxd-webserver-pipe)

Can anyone point me to what log files specifically I need to be looking at to diagnose this and figure out what keeps the service from starting? I've already covered the following:

  • It's not a disk space / log rotation issue
  • It's not the postgre DB (for which I found a few threads, but it's starting properly in our instance)

Our last resort is to simply wipe and reinstall VCSA, but I'd like to avoid it if this is possible to fix.

0 Kudos
1 Solution

Accepted Solutions
nachogonzalez
Expert
Expert

Just to be sure, you are using self signed certificates, right?

try this:

VMware Knowledge Base

Use option 4 and follow the steps.
If that fails and roll back navigate to VMware Endpoint certifcate store /usr/lib/vmware-vmafd/bin/dir-cli
generate a new folder (mkdir Certs_backup)

move all the certificates into the new folder and try again with the KB i told you. That should solve the issue.
Also keep in mind that you should reboot the VCSA at the end.

View solution in original post

15 Replies
nachogonzalez
Expert
Expert

Hey, hope you are doing fine

in my experience this is a certificate error, VMware products will always fail if certificates aren't replaced correcly.

Can you please attach the following logs vpxd.log, (located on /var/log/vmware/)
In addition to that, are you running an embedded PSC or an external PSC?


Have you tried this:

Regenerate a New VMCA Root Certificate and Replace All Certificates

Regenerate Self-Signed Certificate in vSphere 6.5 - VMWare Insight --> Try this first

Hope this works

0 Kudos
cornwella
Contributor
Contributor

Thank you for your reply! We run an embedded PSC. I've attached a segment of the vpxd.log; it looks like it still thinks the certs are expired even after regenerating them:

2020-08-18T17:59:12.882Z error vpxd[7FB8023E7700] [Originator@6876 sub=LSClient] Caught exception while creating LS client adapter: N7Vmacore3Ssl18SSLVerifyExceptionE(SSL Exception: Verification parameters:

--> PeerThumbprint: 08:0A:82:91:0D:F4:CC:62:82:27:66:45:69:BD:78:A7:9A:EB:5B:B5

--> ExpectedThumbprint:

--> ExpectedPeerName: 10.83.1.20

--> The remote host certificate has these problems:

-->

--> * certificate has expired)

I'll try to regenerate the self-signed certificates once more in the articles you linked.

0 Kudos
cornwella
Contributor
Contributor

After following the steps to regenerate the VMCA Root Certificate in this post:

http://vmwareinsight.com/Articles/2020/1/5802978/Regenerate-Self-Signed-Certificate-in-vSphere-6-5

... it gets stuck at 85% upon restarting the affected services and then rolls back, which sounds very similar to what the poster here is describing:

https://communities.vmware.com/thread/565418

The above post includes the following workaround, but in our case the .buildInfo file permissions are already set to 444, so changing them has no effect:

Custom certificate replacement fails on upgraded vCenter Server Appliance 6.5 Update 1

After you upgrade from vCenter Server Appliance 6.5 to 6.5 Update 1 and try to replace the Machine SSL certificate of vCenter Server Appliance, the operation fails because the vSphere Update Manager service cannot access the /etc/vmware/.buildinfo file as the file permission changed from 444 to 640.

Workaround:

  1. Log in as root to the vCenter Server Appliance.
  2. Change the file permission of /etc/vmware/.buildinfo from 640 back to 444 by running the following command
  3. chmod 444 /etc/vmware/.buildInfo
  4. Replace the Machine SSL certificate

We use a single host without any organizational custom certificate requirements. I'm kind of at a loss since this should be a straightforward procedure.

0 Kudos
cornwella
Contributor
Contributor

Found out I can still get into the PSC web portal (not the main VCSA one, which returns the aforementioned 503 error).

It shows all certificates as being valid and current, so replacing the certs did work to some extent.

Screen Shot 2020-08-19 at 12.27.33 PM.png

Screen Shot 2020-08-19 at 12.27.42 PM.png

Screen Shot 2020-08-19 at 12.27.58 PM.png

The error log in the certificate manager has me thinking this is related to the Update Manager:

Service-control failed. Error Failed to start vmon services.vmon-cli RC=2, stderr=Failed to start updatemgr services. Error: Service crashed while starting

2020-08-19T16:28:19.396Z ERROR certificate-manager None

2020-08-19T16:28:19.397Z ERROR certificate-manager Error while starting services, please see log for more details

This matches the Update Manager issue in this KB, but I can't stop the service as I can't log into the VCSA web interface to turn it off:

https://kb.vmware.com/s/article/2150895

0 Kudos
cornwella
Contributor
Contributor

I have no clue what changed in between the two cert resets, but the VCSA web portal is now working again and the service is starting even though the last Machine cert reset failed and attempted to roll back the changes. I'll write this one up to ghosts. If anyone else runs across the same issue, let me know! I'll mark this as solved.

0 Kudos
nachogonzalez
Expert
Expert

Just to be sure, you are using self signed certificates, right?

try this:

VMware Knowledge Base

Use option 4 and follow the steps.
If that fails and roll back navigate to VMware Endpoint certifcate store /usr/lib/vmware-vmafd/bin/dir-cli
generate a new folder (mkdir Certs_backup)

move all the certificates into the new folder and try again with the KB i told you. That should solve the issue.
Also keep in mind that you should reboot the VCSA at the end.

View solution in original post

nachogonzalez
Expert
Expert

Glad this worked

0 Kudos
Thogus
Contributor
Contributor

Hi Nacho,

we have the same problem here but I don't understand your solution.

What means

navigate to VMware Endpoint certifcate store /usr/lib/vmware-vmafd/bin/dir-cli

?

"/usr/lib/vmware-vmafd/bin/dir-cli" isn't a directory where I can generate a new directory. Or did you meant that I have to use this script to move the certificates? Can you explain it a little more detail?

Big thanks and kind regards

Thomas

0 Kudos
nachogonzalez
Expert
Expert

Hey, hope you are doing fine.
Are you using a Windows based vCenter or VCSA (linux based)?

0 Kudos
Thogus
Contributor
Contributor

Hi, ya hope you are doing fine too.

We are using the linux based VCSA with version 6.5.0.22000.

0 Kudos
nachogonzalez
Expert
Expert

If you log in with root user via ssh
are you able to do this?

cd /usr/lib/vmware-vmafd/bin/dir-cli

0 Kudos
Thogus
Contributor
Contributor

No, because it isn't a directory on our side:

root@esxvcenter01 [ ~ ]# cd /usr/lib/vmware-vmafd/bin/dir-cli

bash: cd: /usr/lib/vmware-vmafd/bin/dir-cli: Not a directory

0 Kudos
nachogonzalez
Expert
Expert

Hey, i looked at some notes
dir-cli is a certificate management tool that will help you regenerate solution user's certificates

This will help youhttps://www.settlersoman.com/how-to-publish-root-ca-into-the-trusted-store-in-vmware-endpoint-certif...

https://www.settlersoman.com/how-to-publish-root-ca-into-the-trusted-store-in-vmware-endpoint-certif...

0 Kudos
bnahuy
Contributor
Contributor

I tried this KB first.

https://kb.vmware.com/s/article/76719

Use this help to copy the file to vcenter.

https://techbrainblog.com/2015/03/30/how-to-scp-files-to-vmware-vcenter-appliance-6-0-vcsa/

But It stucked with the error "Failed to start vpxd-svcs, vapi-endpoint services. Error: Operation timed out"

then try your suggestion, now vcenter can be accessed now.

 

I'm pretty sure that this issue came from STS certificate expired.

 

0 Kudos
JohnnyWayneJr
Contributor
Contributor

Hi bnahuy, 

What suggestion specifically did you try?

I have regenerated the cert as well but still failing at starting services, same as original poster indicated

JohnnyWayneJr_0-1620221113820.png

 

0 Kudos