VMware Cloud Community
5mall5nail5
Enthusiast
Enthusiast

vRA buggered bad - root / full, services no status, error 404s

Hey all -

Went into our non-load balanced vRA setup to find some funkiness in the UI about Model Manager within the Infrastructure tab.  I rebooted the IaaS VM and the error persisted.  So, I rebooted the vRA appliance.

Coming back up, elastisearch came by saying it could not start and other services either took a long time or did not start.  Once up, I logged in as root and check disk space with df -h to find that / was full.  I added a disk to the VM, took a snapshot, and moved some log files - /var/log was 35G.  Rebooted, figured everything would be fine, except it's still not happy.

I spent a couple hours searching logs for blatant errors but everything is pretty generic.  When I try to log into https://fqdn-vra/ I get the "VMware vRealize Automation Appliance" page.  So that works.  When I click the "vRealize Automation console" link I get a page with a heading "Identity Manager" and "The page you were looking for is not available. You may need to contact your administrator with this error: 404 Page Not Found."

So, clearly vIDM or whatever is not running.  Checking services on https://fqdn-vra:5480 is full of a bunch of FAILED and no status.


Failed:

branding-service

console-proxy-service

container-service

content-management

forms-service

healthbroker-proxy-server

management-service

portal-service

properties-service

release-management

reservation-service

shell-ui-app

Registered:

component-registry

licensing-service

plugin-service

pricing-api

vcbm-service

Everything else has no status.

Right now, at this point, / has 10% free, so I don't think it's a disk space issue any longer.  All of the other mount points in df -h have very low % utilized.

in VAMI under vRA Host Settings, the cert is not expired, SSO info says "Configured - working connected", the DB is connected, under licensing it shows "Error connecting to server. Is vRA running?"

Maybe someone has seen this before or is good at troubleshooting vRA?  Appreciate any help!

Reply
0 Kudos
9 Replies
daphnissov
Immortal
Immortal

First of all, what version is this? Second, I probably wouldn't have added that disk. The appliance should NOT be filling up like that and so creating free space will be necessary. For the time being, you may want to open an SR to have a look at this situation, especially if this is in production.

Reply
0 Kudos
5mall5nail5
Enthusiast
Enthusiast

Thanks Daphnissov - it's 7.3, been running reliably since early last year.

FWIW I didn't actually have to leverage that disk - I never partitioned it or added it to any LVM, etc.  I zipped some logs to gain space.  I agree that I shouldn't need space like this - I have another environment that has similar up time and no where near as many logs.  Many of the large log files were from 2017 so not sure what happened there.  I've opened an SR, hoping we can figure it out.  Would hate to have to reconfigure all of blueprints, etc.

Reply
0 Kudos
daphnissov
Immortal
Immortal

You may be experiencing this issue with the run-away logging of the Health Service:  https://kb.vmware.com/s/article/2151693

Reply
0 Kudos
5mall5nail5
Enthusiast
Enthusiast

Thanks daphnissov​ - that issue is not the one being faced it seems

Spent a couple hours on the phone with VMware over my SR.  We made progress for sure, but not out of the woods yet.  Both support and I had a hard stop, so continuing tomorrow most likely.

Where it stands now is almost all services are registered, except for:

  • vco
  • shell-ui-app
  • release-management
  • o11n-gateway-service
  • advanced-designer-service

It appears there was a RabbitMQ issue which has been resolved we think.  For some reason there is an issue involving vCenter Orchestrator (embedded).  When I tail /var/log/vmware/vcac/catalina.out I see:

2018-02-22 21:50:16,846 vcac: [component="cafe:o11n-gateway" priority="WARN" thread="tomcat-http--33" tenant="vsphere.local" context="BmP2Ahps" parent="" token="BmP2Ahps"] com.vmware.vcac.o11n.gateway.vco.VcoSessionManager.createNewSession:303 - 85021-Unable to establish a connection to vCenter Orchestrator server.

And variations thereof.  So something is unhappy on the vCO and o11n-gateway front.. hopefully can be sorted!

Reply
0 Kudos
daphnissov
Immortal
Immortal

Ok, keep us posted. Would be good to see the full resolution when you have it to assist others.

Reply
0 Kudos
saurabh1985
Contributor
Contributor

Hello,

Did we manage to get a resolution for this issue. If yes, then please share on what was done to resolve it as I am experiencing the same issue.

Thanks

Saurabh Kumar

Reply
0 Kudos
saurabh1985
Contributor
Contributor

May you share the SR number if possible.

Reply
0 Kudos
r0j
Enthusiast
Enthusiast

Interesting that this kb shows 7.3.1 and 7.4 as resolving this issue, however when we upgraded our healthy 7.3 environment to 7.4, precisely the same issue happens with log and root partitions filling up, and crashing out the appliance.

If 7.4 resolves this issue, why did excessive logging take out our 7.4 vRA deployment?

We have another post on here regarding that, and a vmware support ticket that was created a few days after the 7.4 release.

Still no resolution, so we are planning a 7.3-7.4 migration vs. upgrade...

Reply
0 Kudos
r0j
Enthusiast
Enthusiast

Have a look at this post, may or may not be related to your issue, and may be helpful to others at some point.

https://communities.vmware.com/thread/586810

In our case heap memory dumps after a 7.3-7.4 upgrade were enabled in /etc/init.d/vrhb-service (look for entries -XX:+HeapDumpOnOutOfMemoryError) which created 300MB java_pid.hprof files every 10 minutes in /var/lib/vcac, filling up root / until it blew sky high.

error 404 and no services up were our symptoms as well.

also look for massive catalina.log .out files in /storage/log/vmware/vcac/