VMware Cloud Community
rkrichevskiy
Enthusiast
Enthusiast

Odd deadlock errors

Has anyone seen these type of update errors (attaching stacktrace). I am getting this on one of our systems (v5.5.3) with MSSQL database:

2016-03-03 11:54:25.260-0500 [8a81abd7533d613901533d66b02f009e] WARN  {USERID:WORKFLOWNAME:7da698be-aea1-4f6d-8793-02e88559c135:8a81abd7533d613901533d66a4f7006e} [WorkflowItemSubElementsRunner$SubWorkflowHandlerImpl] Unable to update a sub workflow token / content

ch.dunes.util.DunesServerException: org.springframework.dao.CannotAcquireLockException: could not execute batch; SQL [update vmo_subworkflowtoken set globalstate=?, itemname=?, parentwftokenid=?, runningpassword=?, runningusername=?, serveruri=?, workflowid=?, workflowtokenid=? where id=?]; nested exception is org.hibernate.exception.LockAcquisitionException: could not execute batch

It seems to be intermittent on a couple of workflows that use locking mechanism and nested type of sub workflow calls.It results in outer workflow hanging indefinitely. ALLOW_SNAPSHOT_ISOLATION and READ_COMMITTED_SNAPSHOT are set on the database.

I have a support case opened but I hope someone ran into this type error and can point in the right direction while I am waiting for support to complete their investigation.

11 Replies
igaydajiev
VMware Employee
VMware Employee

In the past there were few reports reagrading issues related to deadlock's when using MSSQL database. They should be resolved in vRO 5.5.3.

This particular one looks related to nested workflow execution ( inside workflow 7da698be-aea1-4f6d-8793-02e88559c135:8a81abd7533d613901533d66a4f7006e)

Could you provide whole server .log file if not confidentional?

Please also open a tracking case to VMware and we will take a look.


PS: We have no such report when using Postgres DB. If this is an option for you I can sugest switching to Postgres.


Reply
0 Kudos
rkrichevskiy
Enthusiast
Enthusiast

Yes, I recall dealing with purge scheduler in 5.1.* and current situation sounded really similar, but digging through the logs I can see that token purge executes on the hour, however current deadlock can occur at any random time. The weird thing is that we have numerous instances running v5.5.3 on MSSQL and only one instance exhibiting this behavior on fairly consistent basis.

I have a support case (16906856003 ) already in progress with full log bundle submitted, just on stand by for review.

Changing db may not be a good approach for us at this time, but I will suggest it to my team. Would you recommend to run an appliance Orchestrator instances, with local PostgreSQL? Currently we are on a standalone Orchestrator instances with databases on a separate VM.

Reply
0 Kudos
igaydajiev
VMware Employee
VMware Employee

>Yes, I recall dealing with purge scheduler in 5.1

I am actually refering to exactly this issue. The difference I see that before the lock was happening between Workflow Token and purging job of completed workflow tokens.

Current exception leads to Sub Workflow tokens (which are trigered using 'Nested workflow' element from palete. So it is rather shot in the dark currently.

I will take a look at the provided log files and try to find out out who is holding the lock meantime it will help if you can try to investigated also on DB side which queries are locked when issue reapears and also attach mentioned workflow.,


>Changing db may not be a good approach for us at this time, but I will suggest it to my team. Would you recommend to run an appliance Orchestrator instances, with local PostgreSQL? Currently we are on a standalone Orchestrator instances with databases on a separate VM.

Depends on the load. Using embeded postgres will mean that DB is sharing resources with vRO server itself. So for larger load having dedicated machine for the DB is better.

Reply
0 Kudos
powerbuck
Contributor
Contributor

What was the resolution to your case? Have you fixed your issue?

We are seeing the exact same deadlock issues intermittently. Running 5.5.2 with MSSQL DB.

Thanks.

Reply
0 Kudos
rkrichevskiy
Enthusiast
Enthusiast

Sorry, no concrete updates as of yet. Tier 1 got my log bundle and sent it off to developers, that's the last I heard from support. Last time I have been through a somewhat similar deadlock situation it took months to diagnose and get a hot fix, so I am not holding out for this to be resolved any time soon,

In our environment I have a redundant set of Orchestrators and I am basically using them to handle tasks that tend to produce deadlocks (produce hung workflows from user perspective). Mind you that redundant instances are the same type of build as the instance which experiences the issue, so I am hoping this is something with this one database rather than application issue.

I am considering not using MSSQL DBs in the future builds, but unfortunately it's not an option for us right now.

Reply
0 Kudos
igaydajiev
VMware Employee
VMware Employee

Just got to the case Smiley Happy

I have reviewd provided log's and it seems that this time the issue is different and is related to usage of "Nested worklfow" while we ar einvestigating it you could try as workaround to replace "Nested" workfow elements with "Asynchronous workflow" execution.

@powerbuck

As pointed in earlier comments in version prior to 5.5.3  there were dedalock issues related to purging of complted workflow tokens. This issue got resolved and fix is available in vRO 5.5.3

You can try upgrading to vRO 5.5.3 (or newer, lates tversion is 7)  and verify if the issue is still reproducible

rkrichevskiy
Enthusiast
Enthusiast

Glad it landed in capable hands! Yes, in my observation of this particular deadlock it seems to occurs only when token(s) is returned from execution of inner nested workflows. Outer workflow gets hung up. We do use quite a bit of nested calls, and not all of them are affected. Could be related to the number of workflows we execute within nested call.

I am looking forward to discussing this further after your investigation.

Thanks.

Reply
0 Kudos
igaydajiev
VMware Employee
VMware Employee

Cheking the provided log files it seems it always error on  the same workflow

Unregister Machine - V2:7da698be-aea1-4f6d-8793-02e88559c135:8a81abd7533d613901533d6853f9042a.

It will be beneficial if you can attach it to the case.

Reply
0 Kudos
rkrichevskiy
Enthusiast
Enthusiast

Yes, this is the one I see it on the most. Is it possible to get a webex scheduled, so I can demo it? I won't be able to export a package due export issue (see case 16862656501) and due to the number of workflow components it's not feasible to export it individually.

Reply
0 Kudos
rkrichevskiy
Enthusiast
Enthusiast

I wanted to reply with a potential for a workaround as suggested by the VMware engineer  - we are restructuring our workflows to use asynchronous calls rather then nested. so far this approach has not triggered deadlock errors against our databases. We still get the benefit of running inner workflows in parallel; a small drawback is the user needs to follow asynch workflow execution in client gui if it is used for monitoring activities.

igaydajiev
VMware Employee
VMware Employee

Thanks for provided feedback!

We have tryed to reproduce the issue in our environemnt but was not able. We have provided an update in related case and probably will reques another troubleshooting session.

Reply
0 Kudos