Solved: Waiting Timer isn't resuming (perhaps after restar...

omrsafetyo · ‎12-17-2019

I have a VRO workflow that does a VM decommission. It powers the VM off, disables snapshots, suspends alerting, and then sits on a wait timer for 2 weeks, and then deletes the VM from inventory, removes protection entirely, removes alerting entirely, removes DNS, etc.

I just got a notification that a VM which had been requested for decom 2 weeks ago received an alert. This is because the 2 week alert suspension expired, and the alerting came back on - but the VM is still powered down. The decom process was intended to resume at 10:44 EST (3:44PM GMT), which has now past. I confirmed the variable that is being passed into the waiting timer (decomDate) is set to 10:44 local time. I confirmed this is the appropriate variable being passed in, and everything looks good. However, the WF is not resuming. There are also 9 more decoms that look like they have the date to resume the timer set in the past (5 for yesterday, and then 4 older ones).

I believe the VRO services may have been restarted since these tokens were set to a waiting state, but uptime is older than the oldest submission (114 days), and the Server Restart Behavior for this workflow is set to resume workflow run. I haven't had issues with this in the past. I currently have 47 workflow tokens in this waiting state. This appears to only be an issue on one cluster ( I have two 3-node clusters in production). Restarted the vro-service and no change.

Any ideas what might be going on?

stevedrummond · ‎12-17-2019

I don't know if they ever got improvements but I believe the waiting events have always potentially had issues in clustered vRO's, as they operate from their own independent rabbitmq queues.

I think you'd be better off recording the VMs/dates somewhere (vRO or externally) and running a process every X time (e.g., once per day or once per hour) that processes the list and destroys as appropriate.

View solution in original post

stevedrummond · ‎12-17-2019

I don't know if they ever got improvements but I believe the waiting events have always potentially had issues in clustered vRO's, as they operate from their own independent rabbitmq queues.

I think you'd be better off recording the VMs/dates somewhere (vRO or externally) and running a process every X time (e.g., once per day or once per hour) that processes the list and destroys as appropriate.

omrsafetyo · ‎03-06-2020

Thank you for the suggestion. I ended up just going with a much shorter timer. I basically have it pause execution for only 4 hour increments, and then check to see if we are past the desired date, and resume execution when we get there. It's a lot of pause/resume workflow for what I wanted, but it seems to be working well so far. The benefit is that I don't have to have an active token like with a sleep for the entire duration, but it seems like it doesn't lose track of the token with a 4 hour wait timer, at least so far. I didn't want to externalize the process, because its basically the entirety of the workflow, there's a lot more steps than just destroy the VM - remove from AD, remove from DNS, remove from monitoring, etc. But the explanation of the rabbitmq queue makes sense if thats not clustered.

All

Waiting Timer isn't resuming (perhaps after restart)