VMware Cloud Community
omrsafetyo
Enthusiast
Enthusiast
Jump to solution

Waiting Timer isn't resuming (perhaps after restart)

I have a VRO workflow that does a VM decommission.  It powers the VM off, disables snapshots, suspends alerting, and then sits on a wait timer for 2 weeks, and then deletes the VM from inventory, removes protection entirely, removes alerting entirely, removes DNS, etc.

I just got a notification that a VM which had been requested for decom 2 weeks ago received an alert.  This is because the 2 week alert suspension expired, and the alerting came back on - but the VM is still powered down.  The decom process was intended to resume at 10:44 EST (3:44PM GMT), which has now past.  I confirmed the variable that is being passed into the waiting timer (decomDate) is set to 10:44 local time.  I confirmed this is the appropriate variable being passed in, and everything looks good.  However, the WF is not resuming.  There are also 9 more decoms that look like they have the date to resume the timer set in the past (5 for yesterday, and then 4 older ones).

I believe the VRO services may have been restarted since these tokens were set to a waiting state, but uptime is older than the oldest submission (114 days), and the Server Restart Behavior for this workflow is set to resume workflow run.  I haven't had issues with this in the past.  I currently have 47 workflow tokens in this waiting state.  This appears to only be an issue on one cluster ( I have two 3-node clusters in production).  Restarted the vro-service and no change.

Any ideas what might be going on?
0 Kudos
1 Solution

Accepted Solutions
stevedrummond
Hot Shot
Hot Shot
Jump to solution

I don't know if they ever got improvements but I believe the waiting events have always potentially had issues in clustered vRO's, as they operate from their own independent rabbitmq queues.

I think you'd be better off recording the VMs/dates somewhere (vRO or externally) and running a process every X time (e.g., once per day or once per hour) that processes the list and destroys as appropriate.

View solution in original post

0 Kudos
2 Replies
stevedrummond
Hot Shot
Hot Shot
Jump to solution

I don't know if they ever got improvements but I believe the waiting events have always potentially had issues in clustered vRO's, as they operate from their own independent rabbitmq queues.

I think you'd be better off recording the VMs/dates somewhere (vRO or externally) and running a process every X time (e.g., once per day or once per hour) that processes the list and destroys as appropriate.

0 Kudos
omrsafetyo
Enthusiast
Enthusiast
Jump to solution

Thank you for the suggestion.  I ended up just going with a much shorter timer.  I basically have it pause execution for only 4 hour increments, and then check to see if we are past the desired date, and resume execution when we get there.  It's a lot of pause/resume workflow for what I wanted, but it seems to be working well so far. The benefit is that I don't have to have an active token like with a sleep for the entire duration, but it seems like it doesn't lose track of the token with a 4 hour wait timer, at least so far. I didn't want to externalize the process, because its basically the entirety of the workflow, there's a lot more steps than just destroy the VM - remove from AD, remove from DNS, remove from monitoring, etc. But the explanation of the rabbitmq queue makes sense if thats not clustered.

0 Kudos