Highlighted
Enthusiast
Enthusiast

vCenter 6.5 - vcenter appliance stops working out of the blue, AGAIN!!

This happened to me a few months ago on a fresh install of the vcenter appliance 6.5. It just stopped working a week or two after applying an update. Services would not start and there was no indication as to why. It wasn't a space issue, it wasn't that other issue with a duplicate value in the vpostgres database I read about either. I finally gave up and wiped it out to redeploy from scratch.

Well low and behold, sometime last night vcenter stopped working again. This time it wasn't even a full week after having applied the 6.5.0c patch. Only two services start, none of the others will. My deployment is two appliances, a PSC and the vCenter. The PSC appears fine and the services are showing healthy. The vCenter turned to garbage again.  Here's an output of service-control:

root@mp1vsivcs501 [ ~ ]# service-control --status

Running:

lwsmd vmafdd

Stopped:

applmgmt vmcam vmonapi vmware-cm vmware-content-library vmware-eam vmware-imagebuilder vmware-mbcs vmware-netdumper vmware-perfcharts vmware-rbd-watchdog vmware-rhttpproxy vmware-sca vmware-sps vmware-statsmonitor vmware-updatemgr vmware-vapi-endpoint vmware-vcha vmware-vmon vmware-vpostgres vmware-vpxd vmware-vpxd-svcs vmware-vsan-health vmware-vsm vsphere-client vsphere-ui

Trying to start any service produces a similar output:

root@mp1vsivcs501 [ ~ ]# service-control --start vmware-vpxd-svcs

Perform start operation. vmon_profile=None, svc_names=['vmware-vpxd-svcs'], include_coreossvcs=False, include_leafossvcs=False

2017-04-24T19:36:49.136Z   Running command: ['/usr/bin/systemctl', 'set-environment', 'VMON_PROFILE=NONE']

2017-04-24T19:36:49.140Z   Done running command

2017-04-24T19:36:49.143Z   Running command: ['/usr/bin/systemctl', 'daemon-reload']

2017-04-24T19:36:49.222Z   Done running command

2017-04-24T19:36:49.222Z   Running command: ['/usr/bin/systemctl', 'set-property', u'vmware-vmon.service', 'MemoryAccounting=true', 'CPUAccounting=true', 'BlockIOAccounting=true']

2017-04-24T19:36:49.227Z   Done running command

2017-04-24T19:36:49.231Z   RC = 1

Stdout =

Stderr = Failed to execute operation: Unit file is masked

2017-04-24T19:36:49.231Z   {

    "resolution": null,

    "detail": [

        {

            "args": [

                "Stderr: Failed to execute operation: Unit file is masked\n"

            ],

            "id": "install.ciscommon.command.errinvoke",

            "localized": "An error occurred while invoking external command : 'Stderr: Failed to execute operation: Unit file is masked\n'",

            "translatable": "An error occurred while invoking external command : '%(0)s'"

        }

    ],

    "componentKey": null,

    "problemId": null

}

2017-04-24T19:36:49.231Z   Running command: ['/usr/bin/systemctl', 'unset-environment', 'VMON_PROFILE']

2017-04-24T19:36:49.235Z   Done running command

Error executing start on service vpxd-svcs. Details {

    "resolution": null,

    "detail": [

        {

            "args": [

                "vmware-vmon"

            ],

            "id": "install.ciscommon.service.failstart",

            "localized": "An error occurred while starting service 'vmware-vmon'",

            "translatable": "An error occurred while starting service '%(0)s'"

        }

    ],

    "componentKey": null,

    "problemId": null

}

Service-control failed. Error {

    "resolution": null,

    "detail": [

        {

            "args": [

                "vmware-vmon"

            ],

            "id": "install.ciscommon.service.failstart",

            "localized": "An error occurred while starting service 'vmware-vmon'",

            "translatable": "An error occurred while starting service '%(0)s'"

        }

    ],

    "componentKey": null,

    "problemId": null

}

The first thing that pops out for me is line 11, "Failed to execute operation: Unit file is masked". I get that on every service I attempt to start and I'm not finding anything in VMware's knowledge portal about it. This is extremely frustrating.

**Additional info**

Running a search on just unit file is masked took me to a generic ubuntu thread about systemctl showing masked unit files. Here's the output of the systemctl list-unit-files:

root@mp1vsivcs501 [ ~ ]# systemctl list-unit-files | grep vmware

vmware-bigsister.service               static 

vmware-cm.service                      masked 

vmware-content-library.service         masked 

vmware-eam.service                     masked 

vmware-firewall.service                enabled

vmware-imagebuilder.service            masked 

vmware-mbcs.service                    masked 

vmware-netdump.service                 masked 

vmware-perfcharts.service              masked 

vmware-rbd-watchdog.service            masked 

vmware-rhttpproxy.service              masked 

vmware-sca.service                     masked 

vmware-sps.service                     masked 

vmware-statsmonitor.service            masked 

vmware-updatemgr.service               masked 

vmware-vapi.service                    masked 

vmware-vcha.service                    masked 

vmware-vmon.service                    masked 

vmware-vmonapi.service                 masked 

vmware-vpostgres.service               masked 

vmware-vpxd-svcs.service               masked 

vmware-vpxd.service                    masked 

vmware-vsan-health.service             masked 

vmware-vsm.service                     masked 

vmware-bigsister.timer                 disabled

Not sure if that's normal or not, but it appears to be what the error message is complaining about?

Message was edited by: jhboricua

12 Replies
Highlighted
Enthusiast
Enthusiast

I ran into the same issue as part of a very long (20+ hour) P1 call on my vpxd service crashing if a VM gets assigned an invalid VDS network port group, the only resolution was to restore from backup or redeploy unfortunately

0 Kudos
Highlighted
Enthusiast
Enthusiast

There's gotta be something else to this. I'm not running a VDS in my setup. It's all standard vSwitches.

0 Kudos
Highlighted
Enthusiast
Enthusiast

We had this issue last week.  After a 3-hr support call with a vCenter support engineer, he came up with the idea of looking around the forums.  The fix is to UNMASK vmon.service:

systemctl unmask vmon.service

Then reboot your appliance.  This fixes the issue.

We still do not know why the vmon service got masked to begin with.  Maybe some kind of race condition during shutdown, it does a lot of systemctl masking/unmasking via the appliance start up and shut down scripts?

0 Kudos
Highlighted
Contributor
Contributor

HI -

Warning 1. The following is a Linux solution to the problem and does not take into account any of the configurations and reasons for the masking.

Warning 2. The ongoing failures seems to be caused by the system boot / shutdown process - so external issues may still be in play .. be careful - suggest only for lab testing  ...

login as root ...

enter

shell <cr>

cd /etc/systemd/system

<Please note this may just seem to be a directory BUT there is a lot going on here directly connected to kernel>

ls -lisa

< here is the files I found masked>

root@localhost [ ~ ]# systemctl list-unit-files | grep masked

applmgmt.service                       masked 

vmcam.service                          masked 

vmware-cis-license.service             masked 

vmware-cm.service                      masked 

vmware-content-library.service         masked 

vmware-eam.service                     masked 

vmware-imagebuilder.service            masked 

vmware-mbcs.service                    masked 

vmware-netdump.service                 masked 

vmware-perfcharts.service              masked 

vmware-pschealth.service               masked 

vmware-rbd-watchdog.service            masked 

vmware-rhttpproxy.service              masked 

vmware-sca.service                     masked 

vmware-sps.service                     masked 

vmware-statsmonitor.service            masked 

vmware-updatemgr.service               masked 

vmware-vapi.service                    masked 

vmware-vcha.service                    masked 

vmware-vmonapi.service                 masked 

vmware-vpostgres.service               masked 

vmware-vpxd-svcs.service               masked 

vmware-vpxd.service                    masked 

vmware-vsan-health.service             masked 

vmware-vsm.service                     masked 

vsphere-client.service                 masked 

vsphere-ui.service                     masked 

ctrl-alt-del.target                    masked 

then look in the directory

root@localhost [ /etc/systemd/system ]# ls -lisa

total 108

451046 4 drwxr-xr-x 24 root root 4096 May  9 03:03 .

450562 4 drwxr-xr-x  7 root root 4096 May  9 01:35 ..

452681 0 lrwxrwxrwx  1 root root    9 May  8 08:11 applmgmt.service -> /dev/null

467464 4 drwxr-xr-x  2 root root 4096 May  8 08:19 applmgmt.service.d

451876 0 lrwxrwxrwx  1 root root   40 Oct 22  2016 default.target -> /usr/lib/systemd/system/runlevel3.target

451048 4 drwxr-xr-x  2 root root 4096 May  8 17:08 getty.target.wants

467484 4 drwxr-xr-x  2 root root 4096 May  8 08:19 halt.target.wants

451195 4 drwxr-xr-x  2 root root 4096 Oct 22  2016 local-fs.target.wants

467460 4 drwxr-xr-x  2 root root 4096 May  8 08:19 lwsmd.service.d

451050 4 drwxr-xr-x  2 root root 4096 May  8 09:03 multi-user.target.wants

451054 4 drwxr-xr-x  2 root root 4096 Oct 22  2016 network-online.target.wants

467486 4 drwxr-xr-x  2 root root 4096 May  8 08:19 poweroff.target.wants

467482 4 drwxr-xr-x  2 root root 4096 May  8 08:19 reboot.target.wants

451834 4 -rw-r--r--  1 root root  268 Jun  7  2016 sendmail.service

467117 4 drwxr-xr-x  2 root root 4096 May  8 08:19 shutdown.target.wants

452083 4 -rw-r--r--  1 root root  476 Aug 22  2016 snmpd.service

451056 4 drwxr-xr-x  2 root root 4096 Oct 22  2016 sockets.target.wants

451058 4 drwxr-xr-x  2 root root 4096 Oct 22  2016 sysinit.target.wants

451107 0 lrwxrwxrwx  1 root root   39 Oct 22  2016 syslog.service -> /usr/lib/systemd/system/rsyslog.service

452464 4 -r-xr-xr-x  1 root root  470 Jan 18 10:08 vcha-hacheck.service

452104 4 drwxr-xr-x  2 root root 4096 May  8 08:19 vmafdd.service.d

452121 4 drwxr-xr-x  2 root root 4096 May  8 08:19 vmcad.service.d

452752 0 lrwxrwxrwx  1 root root    9 May  8 08:17 vmcam.service -> /dev/null

467023 4 drwxr-xr-x  2 root root 4096 May  8 17:10 vmcam.service.d

452116 4 drwxr-xr-x  2 root root 4096 May  8 08:19 vmdird.service.d

452157 4 drwxr-xr-x  2 root root 4096 May  8 08:19 vmdnsd.service.d

451129 4 drwxr-xr-x  2 root root 4096 Oct 22  2016 vmtoolsd.service.requires

452654 0 lrwxrwxrwx  1 root root    9 May  8 08:10 vmware-cis-license.service -> /dev/null

452651 0 lrwxrwxrwx  1 root root    9 May  8 08:09 vmware-cm.service -> /dev/null

452726 0 lrwxrwxrwx  1 root root    9 May  8 08:14 vmware-content-library.service -> /dev/null

452734 0 lrwxrwxrwx  1 root root    9 May  8 08:16 vmware-eam.service -> /dev/null

452761 0 lrwxrwxrwx  1 root root    9 May  8 08:18 vmware-imagebuilder.service -> /dev/null

452707 0 lrwxrwxrwx  1 root root    9 May  8 08:12 vmware-mbcs.service -> /dev/null

452684 0 lrwxrwxrwx  1 root root    9 May  8 08:11 vmware-netdump.service -> /dev/null

452763 0 lrwxrwxrwx  1 root root    9 May  8 08:18 vmware-perfcharts.service -> /dev/null

467114 4 drwxr-xr-x  2 root root 4096 May  8 08:19 vmware-psc-client.service.d

452247 0 lrwxrwxrwx  1 root root    9 May  8 08:11 vmware-pschealth.service -> /dev/null

452745 0 lrwxrwxrwx  1 root root    9 May  8 08:16 vmware-rbd-watchdog.service -> /dev/null

452646 0 lrwxrwxrwx  1 root root    9 May  8 08:09 vmware-rhttpproxy.service -> /dev/null

452664 0 lrwxrwxrwx  1 root root    9 May  8 08:10 vmware-sca.service -> /dev/null

452435 0 lrwxrwxrwx  1 root root    9 May  8 08:16 vmware-sps.service -> /dev/null

452690 0 lrwxrwxrwx  1 root root    9 May  8 08:11 vmware-statsmonitor.service -> /dev/null

467472 4 drwxr-xr-x  2 root root 4096 May  8 08:19 vmware-stsd.service.d

467476 4 drwxr-xr-x  2 root root 4096 May  8 08:19 vmware-sts-idmd.service.d

452749 0 lrwxrwxrwx  1 root root    9 May  8 08:17 vmware-updatemgr.service -> /dev/null

452667 0 lrwxrwxrwx  1 root root    9 May  8 08:10 vmware-vapi.service -> /dev/null

452751 0 lrwxrwxrwx  1 root root    9 May  8 08:17 vmware-vcha.service -> /dev/null

467468 4 drwxr-xr-x  2 root root 4096 May  8 08:19 vmware-vmon.service.d

452692 0 lrwxrwxrwx  1 root root    9 May  8 08:11 vmware-vpostgres.service -> /dev/null

452718 0 lrwxrwxrwx  1 root root    9 May  8 08:12 vmware-vpxd.service -> /dev/null

452704 0 lrwxrwxrwx  1 root root    9 May  8 08:11 vmware-vpxd-svcs.service -> /dev/null

452483 0 lrwxrwxrwx  1 root root    9 May  8 17:11 vmware-vsan-health.service -> /dev/null

452758 0 lrwxrwxrwx  1 root root    9 May  8 08:18 vmware-vsm.service -> /dev/null

<This will display all the files and more importantly links in the system .. we need to remove all the links to /dev/null>

<I have removed them all ,,,, but it may be a case - only some of them should be removed .. remember this is a kernel control area>

<and there are usually good reason to stop root for doing things ...- the masking is a protective process which is like a database holding a process until finishing a write .. >

<this command will remove all the links and ignore the directories ...>

rm vmware*

< I also did the other files that where linked to /dev/nul ...>

then reboot

reboot <cr>

hope that helps - I am not a vmware specialists - my knowledge is linux and this is a systems solution that may not fix an underlying issue  ....

regards

Jeremy

Highlighted
Contributor
Contributor

Still a bug in December Patch, here are some easier steps to unmask all services.


# List all disabled services for removal.

find /etc/systemd/system/ -lname '/dev/null' -exec ls {} \; 

# Automatically remove them (or rm each file)

find /etc/systemd/system/ -lname '/dev/null' -exec rm {} \;

# Relaod systemctl daemon

systemctl daemon-reload

# Start services or Reboot

service-control --start --all

Highlighted
Contributor
Contributor

Those 4 lines did the trick for me JacobDEvans​, thanks for the post.

Highlighted
Enthusiast
Enthusiast

Hi@all,

this issue isn´t fixed with the latest VCSA 6.5U1g update.

Information from VMware Support:

This issue is tracked and will be fixed with vCenter 6.5 UPDATE2.

For now the workaround described by JacobDEvans​​ is supported.

Thanks and BR/JO!



Johannes Strasser / SDDC Architect @ Porsche Informatik GmbH
Twitter: @jo_strasser
Highlighted
Contributor
Contributor

Just wanted to update on status of this. Still having the same problem with 6.5 U2b. Solution still works.

0 Kudos
Highlighted
Enthusiast
Enthusiast

Thanks for the solution. It was driving me crazy. I would still like to know the cause.

0 Kudos
Highlighted
Contributor
Contributor

So what caused it for me was snapshoting the vcenter server while it was on.  After restoring from that snapshot it had this issue.

0 Kudos
Highlighted
Enthusiast
Enthusiast

This happened to my VCSA 6.7 appliance after I cloned it to migrate to a new cluster, and started it up. So thankful to find this post!

0 Kudos
Highlighted
VMware Employee
VMware Employee

0 Kudos