Hello everyone,
We recently had a company install a new 3 node vSphere environment which initially seemed to go well. However I noticed that some Veeam jobs fail when the VM being backed up was on a specific ESXi node, but everything works fine on the other 2. Initially I thought this was a Veeam issue but digging into the Veeam logs i found the following error:
[04.07.2019 12:24:54] < 3040> vdl| WARN|[vddk] [NFC ERROR] NfcNewAuthdConnectionEx: Failed to connect: The remote host certificate has these problems:
[04.07.2019 12:24:54] < 3040> vdl| WARN|[vddk]
[04.07.2019 12:24:54] < 3040> vdl| WARN|[vddk] * A certificate in the host's chain is based on an untrusted root.
Which pointed me towards the issue being with the ESXi server. I dug into the /var/log/vmauthd.log log on the ESXi server that is effected and found the following.
2019-07-08T13:47:44Z vmauthd[2108903]: lib/ssl: OpenSSL using FIPS_drbg for RAND
2019-07-08T13:47:44Z vmauthd[2108903]: lib/ssl: protocol list tls1.2
2019-07-08T13:47:44Z vmauthd[2108903]: lib/ssl: protocol list tls1.2 (openssl flags 0x17000000)
2019-07-08T13:47:44Z vmauthd[2108903]: lib/ssl: cipher list ECDHE+AESGCM:RSA+AESGCM:ECDHE+AES:RSA+AES
2019-07-08T13:47:44Z vmauthd[2108903]: lib/ssl: curves list prime256v1:secp384r1:secp521r1
2019-07-08T13:47:44Z vmauthd[2108903]: Connect from remote socket (172.18.4.53:61252).
2019-07-08T13:47:44Z vmauthd[2108903]: Connect from 172.18.4.53
2019-07-08T13:47:44Z vmauthd[2108903]: SSL Error: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca
2019-07-08T13:47:44Z vmauthd[2108903]: recv() FAIL: 1.
2019-07-08T13:47:44Z vmauthd[2108903]: VMAuthdSocketRead: read failed. Closing socket for reading.
2019-07-08T13:47:44Z vmauthd[2108903]: Read failed.
Which looks like there is an issue with the certificate authority on the effected host, which would tie in nicely with the error i am seeing in Veeam. So then I compared the CA on one of the working nodes with the non-working one with this command
openssl crl2pkcs7 -nocrl -certfile /etc/vmware/ssl/castore.pem | openssl pkcs7 -print_certs -noout
Working ESXi node
subject=/CN=CA/DC=vsphere/DC=local/C=US/ST=California/O=DWLAN-VCA01.brand.local/OU=VMware Engineering
issuer=/CN=CA/DC=vsphere/DC=local/C=US/ST=California/O=DWLAN-VCA01.brand.local/OU=VMware Engineering
subject=/O=VMware/CN=SMS-190614111842368
issuer=/O=VMware/CN=SMS-190614111842368
Non-working ESXi node
subject=/CN=CA/DC=vsphere/DC=local/C=US/ST=California/O=DWLAN-VCA01.brand.local/OU=VMware Engineering
issuer=/CN=CA/DC=vsphere/DC=local/C=US/ST=California/O=DWLAN-VCA01.brand.local/OU=VMware Engineering
subject=/O=VMware/CN=SMS-190614111842368
issuer=/O=VMware/CN=SMS-190614111842368
And they are identical. I'm not sure how to move forward from here. Can anyone help at all?
Thanks in advance. Frank
Are these hosts connected to a vCenter Server? If so, regenerate all certs from the vSphere Client and try again.
Thanks for the quick response. Yes they are connected to a vCenter server.
Is there any risk in doing this, and is this the guide you would follow?
There shouldn't be any risk if using the VMCA to issue certs. As long as everything talks through vCenter and not the ESXi hosts directly, you're fine (and even then you'll just have to accept the new cert). Depending on your client, you can just right-click a host and go to (in the Flex client) Certificates > Refresh certificates.
Thanks, let me see how I get on with this and I will report back to you.
Frank
It seems we are using out own certificate for this. which has made me very reluctant to issue new SSL for all the hosts. I don't know enough about how SSL is used in vSphere, and I'm concerned that I may negatively effect the live hosts.
Does anyone know a way that I could fix this one host?
Based on the cert contents you posted, you are not using custom certificates for the ESXi hosts although you may be using custom certs for the machine cert of the vCenter. If you would like to open a support case with both Veeam and VMware, you're welcome to proceed.
I have already raised a ticket with Veeam, and there response was "this is a VMware issue", which i think is fair enough. I believe that your solution is probably the right one, I'm just reluctant to make this change as i don't really understand what this certificate is doing, and therefor the effect of reissuing it to the live nodes.
I have kicked it back to the consultation company that built the cluster for us. Hopefully they can resolve it.
