VIO2 - Provisioning Error: Failed to Execute Task ...

IMrMarkAtVonage · ‎01-08-2016

Hey guys,

I've seen a couple threads along this same line on the forum already, but nothing with any real solution that I've been able to figure out. When I go to deploy the initial VIO instance, everything gets cloned successfully and seems to come online. However, after the deployment gets to about 86% things eventually bomb out and show a full provisioning error across the board, with that specific execution error being reported against my two controller node IPs. I've tried to get onto the boxes using the admin account I specified during the configuration of the deployment, but my login fails across all the deployed boxes... I mention this because I'm not sure if that's somehow a clue. If I try to browse to the IP address of one of the controllers, I get a redirect to the horizon login page, some of the formatting, but then a "page you were looking for doesn't exist" error.

I've tried different things to get this going, from recreating all the networking, to trying completely different schemes of deployment in terms of where things will live, etc... each of the 6 deployment attempts I've done have ended up the same way. I can't help but think that admin login thing is part of this, since I would imagine that to be a fundamental need for the whole process... but maybe it uses a random key during the setup, then sets the admin account once the deployment is done? Not sure. No LDAP, btw, just a local admin.

Anyway, hopefully someone here has an idea or two I can try. Thanks for taking the time to read this!

jbrowne · ‎01-08-2016

Hi,

Can you send on the following information for us to have a look.

1) A screen shot of the error message from the deployment

2) On the VIO Management server ( The vApp that you initially deployed out )

- /var/log/oms/oms.log ( a zip file would include them all)

- /var/log/jarvis/ansible.log

This will help troubleshoot the issue.

Thanks,

John.

IMrMarkAtVonage · ‎01-08-2016

Hey there - thanks for that. Seems to point the finger solely at the neutron service on the controllers. Here's what seems to be the relevant area, but doesn't seem to say why it failed:

2016-01-08 17:22:20,852 p=595 u=jarvis | TASK: [config-controller | start neutron on first controller] *****************

2016-01-08 17:22:21,051 p=595 u=jarvis | changed: [172.16.211.242]

2016-01-08 17:22:21,052 p=595 u=jarvis | TASK: [config-controller | wait for neutron to start on first controller for NSX] ***

2016-01-08 17:37:21,560 p=595 u=jarvis | failed: [172.16.211.242] => {"elapsed": 900, "failed": true}

2016-01-08 17:37:21,560 p=595 u=jarvis | msg: Timeout when waiting for 127.0.0.1:9696

2016-01-08 17:37:21,560 p=595 u=jarvis | ...ignoring

2016-01-08 17:37:21,561 p=595 u=jarvis | TASK: [config-controller | stop neutron if port 9696 is not ready] ************

2016-01-08 17:37:21,825 p=595 u=jarvis | changed: [172.16.211.242]

2016-01-08 17:37:32,962 p=595 u=jarvis | ok: [172.16.211.243]

2016-01-08 17:37:32,974 p=595 u=jarvis | TASK: [config-controller | fail if port 9696 is not ready] ********************

2016-01-08 17:37:33,016 p=595 u=jarvis | failed: [172.16.211.243] => {"failed": true}

jbrowne · ‎01-08-2016

What is happening here "TASK: [config-controller | wait for neutron to start on first controller for NSX] ***"

is that the NSX Edge devices are being deployed out and it cannot complete the task within the allotted time of 900 seconds ( 15 mins )

You should see these tasks in vCenter ( Deploying and configuring the VM's, they will be called backup-xxxxxxx )

Is your storage not able to process these VM deployments quick enough ? What is the Storage ?

On the VIO Management server :

We can increase the timout value and wait longer for the NSX edges to be deployed out.

Modify the following file:

/var/lib/vio/ansible/roles/config-controller/tasks/neutron.yml

Find the section :

- name: wait for neutron to start on first controller for NSX

and increase the timeout value

Run the "Deploy OpenStack" again.

IMrMarkAtVonage · ‎01-08-2016

Interesting! I actually do not see any other tasks at all and the system hasn't even tried to deploy any additional NSX components. I'm running this on a mid-tier Nutanix cluster, so the storage is "okay". Let me poke around a little bit and update again if I can come across anything obvious on the NSX side. Thanks a lot so far. Will report back shortly.

IMrMarkAtVonage · ‎01-08-2016

Couldn't come up with anything - doesn't look like NSX is reporting getting any tasks to deploy ESGs, though I do see that VIO did generate the resource pool. Just never got any further. Any other ideas of where I could look to try to identify the issue? Or any dependencies on Neutron being able to call NSX that I might not be considering?

jbrowne · ‎01-08-2016

From the NSX Side :

1. NSX Manager running

2. NSX Controller(s) running ( Showing Normal in Networking & Security -> Installation -> Management )

3. Hosts are Prepared ( Showing Ready in Networking & Security -> Installation -> Host Management )

4. Logical Network Preparation ( VXLAN Transport configured, Segment ID's defined , Transport Zone created )

IMrMarkAtVonage · ‎01-11-2016

Yeah - all solid there, and everything else in the NSX stack seems to be running without issue (other ESGs, DLRs, and some VMs provisioned earlier for something else). It's almost like the bootstrap just isn't issuing the command to stand up anything and just times out, but I'm not sure what would cause that behavior at all. Are there any other VIO logs that might be useful?

admin · ‎01-11-2016

Sign up for VIO office hours and I will try to setup resources for WebEx based live debugging.

tinyurl.com/vio-office

arvind

All

VIO2 - Provisioning Error: Failed to Execute Task INNER