VMware Cloud Community
philzy
Enthusiast
Enthusiast
Jump to solution

Problem with VSAN Health Check Plugin installation

Hi!

After executing - /usr/lib/vmware-vpx/vsan-health/health-rpm-post-install.sh

I get this output:

/usr/lib/vmware-vpx/vsan-health/health-rpm-post-install.sh --force

/usr/lib/vmware-vpx/workflow/bin

2015-05-15T21:32:05.625Z   Getting value for install-parameter: workflow.int.ser   vice-port

2015-05-15T21:32:05.633Z   Getting value for install-parameter: workflow.int.jmx   -port

2015-05-15T21:32:05.643Z   Getting value for install-parameter: vpxd.int.sdk-por   t

2015-05-15T21:32:05.650Z   Getting value for install-parameter: vpxd.int.sdk-tun   nel-port

2015-05-15T21:32:05.658Z   Getting value for install-parameter: rhttpproxy.ext.p   ort1

2015-05-15T21:32:05.665Z   Getting value for install-parameter: rhttpproxy.ext.p   ort2

{'vpxd_sdk_tunnel_port': '8089', 'rhttpproxy_https_port': '443', 'rhttpproxy_htt   p_port': '80', 'workflow_service_port': '8088', 'vpxd_sdk_port': '8085', 'PASSWO   RD': '', 'workflow_jmx_port': '19999'}

2015-05-15T21:32:05.673Z   Getting value for install-parameter: syslog.ext.port

2015-05-15T21:32:05.682Z   Getting value for install-parameter: vc.home.path

2015-05-15T21:32:05.690Z   Getting value for install-parameter: vc.conf.path

2015-05-15T21:32:05.691Z   VSAN Health service firstboot started

2015-05-15T21:32:05.702Z   User %s already exists, skipping creation.

2015-05-15T21:32:05.710Z   Getting value for install-parameter: rhttpproxy.cert

2015-05-15T21:32:05.710Z   WARNING Value for install-parameter rhttpproxy.cert i   s empty

Traceback (most recent call last):

  File "/usr/lib/vmware-vpx/firstboot/vsanhealth_firstboot.py", line 292, in Mai   n

    res = vsanhealth_fb.get_rp_cert_info()

  File "/usr/lib/vmware/site-packages/cis/firstboot.py", line 185, in get_rp_cer   t_info

    thumbprint, ssl_trust, crt = get_certinfo(rp_cert_file)

  File "/usr/lib/vmware/site-packages/cis/tools.py", line 184, in get_certinfo

    f.readFile(cert_file)

  File "/usr/lib/vmware/site-packages/cis/utils.py", line 1028, in readFile

    loErrMsg = localizedString(errMsg, file_name, e)

TypeError: localizedString() takes at most 2 arguments (3 given)

2015-05-15T21:32:05.712Z   VSAN Health firstboot failed

Traceback (most recent call last):

  File "/usr/lib/vmware-vpx/firstboot/vsanhealth_firstboot.py", line 343, in <mo   dule>

    Main()

  File "/usr/lib/vmware-vpx/firstboot/vsanhealth_firstboot.py", line 333, in Mai   n

    if eInfo and eInfo.detail:

UnboundLocalError: local variable 'eInfo' referenced before assignment

vmware-vpxd: Stopping vpxd by administrative request. process id was 9301

success

vmware-vpxd: VC SSL Certificate does not exist, it will be generated by vpxd

Waiting for the embedded database to start up: success

Executing pre-startup scripts...

vmware-vpxd: Starting vpxd by administrative request.

success

vmware-vpxd: Waiting for vpxd to start listening for requests on 8089

Waiting for vpxd to initialize: .success

vmware-vpxd: vpxd has initialized.

Last login: Fri May 15 21:18:53 UTC 2015 on console

Stopping VMware vSphere Web Client...

Stopped VMware vSphere Web Client.

Last login: Fri May 15 21:32:20 UTC 2015 on pts/1

Starting VMware vSphere Web Client...

Waiting for VMware vSphere Web Client......

running: PID:30348

2015-05-16 00_48_38-vSphere Web Client.png

As the result - no buttons.

As far as i understand - there is some problems with certificate.

So, help me with troubleshooting, please.

Thank you.

Tags (4)
1 Solution

Accepted Solutions
rbolgerTrace3
Enthusiast
Enthusiast
Jump to solution

Glad it worked! I managed to solve the rest of my problem getting the plugin loaded as well.  Setting rhttpproxy.cert fixed the problem with health-rpm-post-install.sh finishing successfully.  But after starting the vmware-vsan-health service, health page in the web client still never loaded the buttons like "Enable".

I checked /var/log/vmware/vsan-health/vmware-vsan-health-service.log and noticed it was spamming "Failed to log into VC, retrying in 10 seconds" over and over.  So I went digging through the python source in /usr/lib/vmware-vpx/vsan-health. I managed to figure out that while starting up the web service that hosts the plugin, it tries to connect to vCenter using the vCenter's own SSL cert and private key (rui.crt and rui.key) in /etc/vmware-vpx/ssl. On my VCSA, the permissions in that folder looked like this:

myvcsa:/etc/vmware-vpx/ssl # ls -la

total 28

drwxr-x---  2 root cis  4096 Jul 20 05:00 .

drwxr-xr-x 14 root root 4096 Jul 21 04:05 ..

-rw-------  1 root root 3416 Apr 30 05:36 rui.crt

-rw-------  1 root root 1704 Apr 30 05:36 rui.key

-rw-------  1 root root   65 Apr 30 05:19 symkey.dat

-rw-------  1 root root 3343 Apr 30 05:36 vcsoluser.crt

-rw-------  1 root root 1704 Apr 30 05:36 vcsoluser.key

Now I knew that the health service was running as a local user called vsan-health. So there's no way it would be able to read those files.  Luckily, I had a mostly vanilla VCSA that I could compare it with.  Here's what the vanilla VCSA folder looked like:

myvcsa:/etc/vmware-vpx/ssl # ls -la

total 28

drwxr-x---  2 root cis  4096 Jul 20 05:00 .

drwxr-xr-x 14 root root 4096 Jul 21 04:24 ..

-rw-r-----  1 root cis  3416 Apr 30 05:36 rui.crt

-rw-r-----  1 root cis  1704 Apr 30 05:36 rui.key

-rw-------  1 root root   65 Apr 30 05:19 symkey.dat

-rw-r-----  1 root cis  3343 Apr 30 05:36 vcsoluser.crt

-rw-r-----  1 root cis  1704 Apr 30 05:36 vcsoluser.key

Notice the group ownership difference on the cert related files and the change from 600 to 640 permissions.  When I saw this, I also remembered seeing in the vsan firstboot script that the vsan-health user was being added to the cis group.  As soon as I made my broken VCSA's permissions match the vanilla, the service started up and everything started working.  I'm guessing the reason my permissions were out of whack is a bug with the SSL replacement scripts.  One of the first things I do on my vCenter is update the SSL certs with custom ones from our PKI infrastructure.  I'm guessing that process is currently not working quite right and screws up the permissions on the files that get replaced.

View solution in original post

9 Replies
jonretting
Enthusiast
Enthusiast
Jump to solution

Running into the same issue here with VCSA. Was able to replicate this problem on two fresh vSphere 6.0 clusters. Looking into "vsanhealth_firstboot.py" and "firstboot.py" i noticed some notes, and some commented out stuff related to the problem. Looks like these scripts are still a work in progress.

Some snippets of the previously mentioned python scripts:

-- vsanhealth_firstboot.py -- LINE: 292

         res = vsanhealth_fb.get_rp_cert_info()

         print str(res)

         # XXX: Generating certs doesn't work when invoked after the initial boot

         #res = vsanhealth_fb.generate_certs()

         #print str(res)

So certificate generation doesn't take place since "res = vsanhealth_fb.get_rp_cert_info()". The function "generate_certs()" seems to have other issues including password generation, and other stuff that needs debugging. Hopefully this gets shored up in the next update/patch/bugfix.

Here are the mentioned functions from "firstboot.py".

<code>

def get_rp_cert_info(self):

      rp_cert_file = wait_for_install_parameter('rhttpproxy.cert')

      thumbprint, ssl_trust, crt = get_certinfo(rp_cert_file)

      self._rp_crt_info = {

         'cert_file' : rp_cert_file,

         'thumbprint' : thumbprint,

         'ssl_trust' : ssl_trust,

         'crt' : crt

      }

   # XXX TODO: Delete the generate_certs function after all firstboot scripts

   # switch to use certs and solution user generated in soluser_firstboot.py

   def generate_certs(self, generate_jks=False, component_name=None):

      #

      # TODO: Currently, the certs are generated in a temp location, need to

      # modify this code to directly create certs in the location provided

      # by the component for storing the certs

      #

      if component_name is None:

         component_name = self._component_name

      vmca = CerTool()

      vmca.GenCert(component_name)

      cert_info = {}

      cert_info['cert_file'] = vmca.GetCertFileName()

      cert_info['private_key_file'] = vmca.GetPrivateKeyFileName()

      cert_info['public_key_file'] = vmca.GetPublicKeyFileName()

      cert_info['pfx_file'] = vmca.GetPfxFileName()

      cert_info['password'] = "foo" #vmca.GetPassword()

      create_dir(self.get_ssl_path())

      copyfile(cert_info['cert_file'], self.get_public_crt())

      copyfile(cert_info['private_key_file'], self.get_private_key())

      copyfile(cert_info['pfx_file'], self.get_pfx_file())

      if generate_jks:

          log('Creating JKS keystore ...')

          # If -deststorepass and -srcstorepass arguments are not specified

          # while invoking keytool, keytool will prompt for the destination

          # keystore password twice and the source keystore password once:

          # Enter destination keystore password:

          # Re-enter new password:

          # Enter source keystore password:

          # Since, we reuse the src keystore password as the destination

          # keystore password, we repeat it thrice in stdin. We do not

          # specify the passwords in the command line for security reasons

          # as well as the fact that keytool does not like passwords that start

          # with "-J"

          pwd_stdin = 3 * ('%s\n' % cert_info['password'])

          try:

             invoke_command([get_keytool(),

                            '-importkeystore',

                            '-destkeystore', self.get_jks_file(),

                            '-srckeystore', self.get_pfx_file(),

                            '-srcstoretype', 'PKCS12',

                            '-alias', self._constants['key_alias']],\

                            pwd_stdin)

          except InvokeCommandException as e:

             err = _T('install.ciscommon.firstboot.create.jkskeystore',

                      'ERROR: Failed to create JKS Keystore.')

             err_lmsg = localizedString(err)

             e.appendErrorStack(err_lmsg)

             raise e

          self.import_rp_cert_in_jks(self.get_jks_file(), cert_info['password'])

      return cert_info

</code>

Cheers

philzy
Enthusiast
Enthusiast
Jump to solution

Ok, thank you.

But as far as i see from this code snippets there is no any kind of workaround till now.

So, I'm going to wait for next release of that plug in.  

Reply
0 Kudos
rbolgerTrace3
Enthusiast
Enthusiast
Jump to solution

I was having the same issue with my installation. The root cause of the problem seem to be warning about rhttpproxy.cert being empty.  I noticed that in other people's installations this value was returning the value /etc/vmware-rhttpproxy/ssl/rui.crt which is basically the path to the certificate that the vCenter web client serves.  That file existed on my installation and I verified the cert details with openssl.  So I was left wondering why rhttpproxy.cert was being read as empty.  After extensive google'ing, I stumbled across one of William Lam's blog posts (vCenter Server 6.0 Tidbits Part 1: What install & deployment parameters did I use? | virtuallyGhetto) mentioning the /bin/install-parameter utility.

And indeed, running "/bin/install-parameter rhttpproxy.cert" returned an empty value on my system.  So I took a look at the source (python) for that utility and it appeared to have an optional argument called --setdefault which would supposedly let you set a default value for the parameter.  So I ran "/bin/install-parameter rhttpproxy.cert -s /etc/vmware-rhttpproxy/ssl/rui.crt" which appears to ahve worked.  Now when I run the original command to query the value, it returns the default path.

And finally, I tried re-running the health-rpm-post-install.sh and it claims to have worked.  But unfortunately, I'm still not quite there.  The "Enable" button is still missing from the web client health service page.

jonretting
Enthusiast
Enthusiast
Jump to solution

I just tested your solution out on one heavily used VCSA and a vanilla one, and it was a full success. Much appreciated!

Thanks,

-Jon

Reply
0 Kudos
rbolgerTrace3
Enthusiast
Enthusiast
Jump to solution

Glad it worked! I managed to solve the rest of my problem getting the plugin loaded as well.  Setting rhttpproxy.cert fixed the problem with health-rpm-post-install.sh finishing successfully.  But after starting the vmware-vsan-health service, health page in the web client still never loaded the buttons like "Enable".

I checked /var/log/vmware/vsan-health/vmware-vsan-health-service.log and noticed it was spamming "Failed to log into VC, retrying in 10 seconds" over and over.  So I went digging through the python source in /usr/lib/vmware-vpx/vsan-health. I managed to figure out that while starting up the web service that hosts the plugin, it tries to connect to vCenter using the vCenter's own SSL cert and private key (rui.crt and rui.key) in /etc/vmware-vpx/ssl. On my VCSA, the permissions in that folder looked like this:

myvcsa:/etc/vmware-vpx/ssl # ls -la

total 28

drwxr-x---  2 root cis  4096 Jul 20 05:00 .

drwxr-xr-x 14 root root 4096 Jul 21 04:05 ..

-rw-------  1 root root 3416 Apr 30 05:36 rui.crt

-rw-------  1 root root 1704 Apr 30 05:36 rui.key

-rw-------  1 root root   65 Apr 30 05:19 symkey.dat

-rw-------  1 root root 3343 Apr 30 05:36 vcsoluser.crt

-rw-------  1 root root 1704 Apr 30 05:36 vcsoluser.key

Now I knew that the health service was running as a local user called vsan-health. So there's no way it would be able to read those files.  Luckily, I had a mostly vanilla VCSA that I could compare it with.  Here's what the vanilla VCSA folder looked like:

myvcsa:/etc/vmware-vpx/ssl # ls -la

total 28

drwxr-x---  2 root cis  4096 Jul 20 05:00 .

drwxr-xr-x 14 root root 4096 Jul 21 04:24 ..

-rw-r-----  1 root cis  3416 Apr 30 05:36 rui.crt

-rw-r-----  1 root cis  1704 Apr 30 05:36 rui.key

-rw-------  1 root root   65 Apr 30 05:19 symkey.dat

-rw-r-----  1 root cis  3343 Apr 30 05:36 vcsoluser.crt

-rw-r-----  1 root cis  1704 Apr 30 05:36 vcsoluser.key

Notice the group ownership difference on the cert related files and the change from 600 to 640 permissions.  When I saw this, I also remembered seeing in the vsan firstboot script that the vsan-health user was being added to the cis group.  As soon as I made my broken VCSA's permissions match the vanilla, the service started up and everything started working.  I'm guessing the reason my permissions were out of whack is a bug with the SSL replacement scripts.  One of the first things I do on my vCenter is update the SSL certs with custom ones from our PKI infrastructure.  I'm guessing that process is currently not working quite right and screws up the permissions on the files that get replaced.

jonretting
Enthusiast
Enthusiast
Jump to solution

That's great find, well done and hopefully the plug-in developers take a look at this. Took note of your solution for perm diff, as I am sure it will spring up for me in the same fashion you experienced. Thanks, -Jon

Reply
0 Kudos
Bleeder
Hot Shot
Hot Shot
Jump to solution

I noticed that a new version of the VSAN Health plugin was released yesterday. 

VMware Virtual SAN Health Check Plug-in 6.0.1 Release Notes

OITVIRT
Contributor
Contributor
Jump to solution

We had the exact same problem with the VCSA.  I had to add additional read permissions to rui.crt and rui.key, then the button showed up and everything worked.

Good luck!
Jill

Reply
0 Kudos
philzy
Enthusiast
Enthusiast
Jump to solution

To

Reply
0 Kudos