VMware Cloud Community
maslow
Enthusiast
Enthusiast

vRA 8: Error handling during extensibility workflows from subscriptions

Hi guys,

I have a problem with the vRA 8 error handling using the extensibility subscriptions... in version 7 there was an extensibility guide which clearly states, in which states an occurring error in a vRO workflow will lead to a fail in the deployment. Also in 7 it was possible to handle errors in vRO in that way, that no exception was pushed to vRA.

Use Case: we have some workflows that are not that important, that a deployment should fail, if the workflows fail. We only need a mail as information. So I wrote the workflows that if an exception occurs, the workflows go through the default error handler, send a mail and get to a normal endpoint (black), NOT an Exception Endpoint (red exclamation mark).

But somehow vRA still gets to know there was an exception and fails the complete deployment, so we end up with vRA cleaning up the complete deployment and list it as failed.

How can I tell vRA that if vRO workflows end in normal endpoints should not break the deployment even though there might was an exception, as the exception was already handled ...

In vRA 7 this just worked by default ... WFs ending in "throw" or Exception endpoints == error reported to vRA, WFs ending in normal endpoints regardless if an exception occured == no error reported to vRA (or at least ignored)

Reply
0 Kudos
17 Replies
emacintosh
Hot Shot
Hot Shot

That's interesting.  In our experience with 8.2, it works just as you described.  If our workflow ends normally in our default error handler, vRA considers it a success and moves on.  We use that for some workflows that shouldn't fail a build (like your use cases) and for troubleshooting to allow a bad build to finish so we can try to see what went wrong in the workflow.

 

Reply
0 Kudos
maslow
Enthusiast
Enthusiast

Hi, oh ok, really interesting 🙂

Which Event topic ID are you using? We currently experience it with workflows attached to "network.configure"

Reply
0 Kudos
emacintosh
Hot Shot
Hot Shot

we're subscribed to several, including that one.  But in our case, if network.configure fails, we have to fail.  But if I get some time, I can try to test that.  There are known ways to make that workflow fail for us and so it's something I should be able to do in our dev env.

Reply
0 Kudos
maslow
Enthusiast
Enthusiast

Hm just tried it in compute.provision.post, there it also fails, even though we handled the exception with a normal endpoint 😞

Reply
0 Kudos
maslow
Enthusiast
Enthusiast

Also just tested, if it fails because of entering the default error handler. So I added a normal endpoint and connected the red line and let the workflow fail. And again even without entering the default error handler the workflow ends normal but deployment fails ...

Reply
0 Kudos
emacintosh
Hot Shot
Hot Shot

I tested with network.configure and it did end up failing the build, but not because the workflow failed, but because the output wasn't formatted appropriately....because we never get around to populating all of the various nic arrays appropriately on failure.  

Extensibility triggered task failure: : Expected BEGIN_ARRAY but was STRING at line 1 column 14 path $.gateways

However, when we fail because of an error in the workflow, we get our expected/defined error in vRA instead.  So I do think it was trying to move along....
Reply
0 Kudos
maslow
Enthusiast
Enthusiast

We do get a similar error:

Extensibility triggered task failure: Extensibility error received for topic compute.removal.post, eventId = '4b30fe4f-8a57-348e-8795-d521bfc66ecc': [10030] SubscriberID: vro-gateway-xGjrV41UV3CETq84, RunnableID: cee5af62-f0b4-4920-81be-ff9d3ac3ce5b and SubscriptionID: sub_1614784854768 failed with the following error: No reply from blocking subscription Delete DNS Record(id=sub_1614784854768) for event 4b30fe4f-8a57-348e-8795-d521bfc66ecc. Expire date 2021-03-18 12:03:33.558..

 

Original Task Error: 'Extensibility triggered task failure: : Expected BEGIN_OBJECT but was STRING at line 1 column 293 path $.customProperties'

We dont give back something, as we dont need to, the workflow failed and got handled by default error handler with sending us a mail and thats it. So you mean it might fail as the outputProperties is empty?! Not sure if that is expected behaviour ...

 

The second problem with this is, that if the workflow has failed the vRA deployment, our cleanup workflows are triggered. But once we click delete deployment in vRA, the cleanup workflows get triggered again. This repeats until all cleanup workflows terminated successfully. This cant be expected behaviour ....

Reply
0 Kudos
emacintosh
Hot Shot
Hot Shot

Yep, I think that is your problem.  The various event topics have writable properties.  And your workflows have Outputs to update those properties.  vRA won't be able to read your mind - whether you really want it to update properties or not.  Because the workflow was successful, and the property has been outputed, vRA is going to try to parse/update that property.

And if the property you return is not of the correct/expected type, vRA won't be able to just figure it out.  You need to send something appropriate back to event broker if you're going to claim the workflow was successful.

I would say it's matter of handling that in your default error handler.  For example, if you're sending custom properties back, then somewhere in your handler have scriptable task that brings inputProperties, grab the customProperties and send them out.  That way when the workflow is done, you'll have the same custom properties as when it started.

Network configure may be a little trickier depending on how many values you're writing back, since they're a bunch of various-dimensioned arrays.  But I'd say the same logic should apply - get them outputed in your error handler - even if they're all empty.  

I would need to see about what happens if we delete a failed deployment....I assumed it didn't do anything other than clean up the deployment - would agree that only deletes of successful deployments should trigger cleanup tasks.

Reply
0 Kudos
emacintosh
Hot Shot
Hot Shot

I just deleted a failed deployment, the only topics that fired were the deployment pre/post topics which are expected for us (we log all of that to splunk).  So confirmed that none of our compute.removal events kick off when deleting a failed deployment. 

Reply
0 Kudos
maslow
Enthusiast
Enthusiast

Hm ok we use our cleanup sripts differently from your event topic id I think. We used them like back in vRA 7.

So we attached them to compute.removal.post. They get fired directly if a deployment fails.

 

As I read you attached them to compute.removal, I cant see that, I think you mean "Copmpute Removal" which actually is compute.removal.pre?!

Maybe that one just gets called one time and compute.removal.post gets fired once or more if an error during removal occurs.

Reply
0 Kudos
emacintosh
Hot Shot
Hot Shot

Sorry, I should clarify.  We subscribe to compute.removal.pre (for one task) and the compute.removal.post for several other cleanup tasks.  So yes, if a deployment fails those do fire.

I thought you were saying they would fire a second time when you delete a FAILED deployment manually (like cleaning up Service Broker deployments).  And in our case, they don't.  They do of course kick off when we delete a successful deployment.

Reply
0 Kudos
maslow
Enthusiast
Enthusiast

yes exaclty what they do, they fire twice or more. So if one of the cleanup tasks fails and you re-run delete, they get fired again and again, until deployment is deleted.

 

So yes, they initially fire, when the deployment fails and the deployment runs through compute.removal.post and then once again if we click delete deployment...

 

 

But that thing with empty returned properties was a good hint! I added a scriptable task in the def error handler that has just one line of code "customProperties = new Properties();" and that customProperties is linked to the outputParameter. And tada ... if now it fails, vRA deployment doesnt fail ❤️

 

Now I only need to sort out that double executin thing

emacintosh
Hot Shot
Hot Shot

Ah ok.  I do get kinda caught with our environment....none of our cleanup subscriptions are blocking.  They all just fire when they fire - they don't need to run in a particular order.  And I don't think non-blocking subscriptions can fail a deployment (create or delete).  So if we fail with some cleanup task, so be it - maybe it was expected.

That said, during the build process we keep track of various flags to know what we've done, and the corresponding cleanup workflows only do their cleanup if we created it in the first place.

For example, if we never created an AD record, we don't try to delete one.  If we never created a CMDB CI, we don't try to retire it.  If we didn't create a DNS record, we don't try to remove...and so on. We just store those flags in customProperties and update them appropriately as a build progresses.

 

Reply
0 Kudos
maslow
Enthusiast
Enthusiast

yes that is exactly like we do it.

We also create custom properties for maintaining if DNS, AD, groups, etc were created successfully and start the cleanup workflows only if needed. We check that with a condition on the subscriptions that checks those properties.

But our cleanup workflows are blocking, so vra waits for them, they are not fire and forget.

But even if they were so, if they get fired twice, it is still not good 🙂

Reply
0 Kudos
maslow
Enthusiast
Enthusiast

Just verified this: if the worklow is set as blocking and you dont add an output paramter, it also works. It seems the problem is only when an output paramter is present in the workflow but never filled (null) vra fails.

If you create empty properties for the output paramter it works and it also works if you just remove the output parameter 🙂

emacintosh
Hot Shot
Hot Shot

Glad you got that part going!

 

For the removal, I think the most curious part of your scenario is that you are running your workflows at compute.removal.post.  So if a build fails, and those kick off for the first time, then wouldn't the server already be deleted?  Even if a workflow fails, all of that should happen post removal, right?  So when you later delete that failed deployment....does it even have any compute resources in it anymore?  No idea why a compute subscription would fire if there is no compute...seems counter intuitive.  

 

There is probably a way to get that "we already tried this" logic woven into your removal process - likely several ways to keep track of that info too.

Reply
0 Kudos
maslow
Enthusiast
Enthusiast

Hm yes if the deployment fails, VM (compute) gets already deleted. Our workflows fire and cleanup AD, DNS, etc.

Then if we delete the deployment they fire again. So maybe compute.removal.post istn the best event topic.

But even then ... the removal of compute was already done, so why does this event even fire again then?! Doesnt make sense either.

Reply
0 Kudos