Past Day stats rollupVirtualCenter SQL job failing

jparnell · ‎04-01-2009

Hi,

Since upgrading to vCenter 2.5 U4, we keep having problems with the Past Day stats rollupVirtualCenter SQL job. It keeps failing with the following error:

Log Job History (Past Day stats rollupVirtualCenter)
Step ID 1
Server xxxx
Job Name Past Day stats rollupVirtualCenter
Step Name Past day stats rollup
Duration 00:00:16
Sql Severity 13
Sql Message ID 1205
Operator Emailed
Operator Net sent
Operator Paged
Retries Attempted 0
Message
Executed as user: VirtualCenterDBO. Transaction (Process ID 89) was deadlocked on lock | communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction. (Error 1205). The step failed.

Sometimes, the vCenter service crashes, with error:

An unrecoverable problem has occurred, stopping the VMware VirtualCenter service. Check database connectivity before restarting. Error: Error[VdbODBCError] (-1) "ODBC error: (HY000) - [SQL Native Client][SQL Server]The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions." is returned when executing SQL statement "UPDATE VPX_ENTITY SET NAME = ? ,

Is anyone else having this problem?

James

MvdV · ‎04-09-2009

We experience exactly the same deadlock problems on the Past Day stats rollup job. 50% of the time the job fails on being chosen as a deadlock victim. Same error messages appear in the SQL error log. We also experience that VirtualCenter service crashes frequently.

ian4563 · ‎04-09-2009

Yes, I'm having the same SQL locking problems and infrastructure client crashes after upgrading to VC U4. Do you guys have your statistics set to non-default settings?

WindForce · ‎04-13-2009

Interestingly enough, we have the same problem here after updating to Update 4. I hope someone can shed some light on a possible fix.

VirtualNoitall · ‎04-13-2009

Hello,

We ran into this exact same issue and ended up moving the DB off to a separate server. For this Virtual Center instance we had VC and SQL 2005 on the same server. This issue is with a shortage of memory resource for SQL server. I know there was logic update in U4 to deal with the rollup jobs and these look to have put more strain on the SQL server. We had VC crash 3 times in a week and since the DB moved 8 days ago we have not had a repeat.

The Daily Rollup Job failing is a pain as well. We originally updated to U4 as it was supposed to solve the issue. We found that it didn't and have since updated the rollup jobs to retry up to 2 times in the event of a failure. This has reduced our failures to 0 but also allows us to track retries so we can continue to try to resolve the actual problem and not just cover it up.

Hope this helps

jmcdonald1 · ‎04-15-2009

Hey All,

Unfortunately the rollup deadlock problems were not completely fixed with U4. There is another fix that will be going into a future release which will help to resolve the deadlocks from occurring.

In the mean time you can implement the following KB which has new rollup jobs attached to it:

Cheers,

/Jonathan

WindForce · ‎04-15-2009

Thank you Jonathan. The updated stored procedures seem to have resolved the deadlock problem.

On a side note the instructions in the KB article do not take into account the procedures already existing. As such I had to modify the new procedure code to read ALTER instead of CREATE.

ian4563 · ‎04-15-2009

I just deleted the three existing stored procedures and ran the three queries. Hope that won't cause any problems....

WindForce · ‎04-15-2009

I am sure it wont cause any problems at all. I

posted feedback on the KB article saying they should have the code read ALTER procedure

instead of CREATE procedure. That should

solve the problem for others following the KB article.

Thanks for link to the KB article. Very helpful.

jmcdonald1 · ‎04-15-2009

Glad to hear it. Usually from what I have seen in the scripts, it should check to see if it exists and then recreate if needed. None the less, I will investigate this and make sure that the appropriate instructions are in the KB. As long as you are runnign the new code, that is what matters.

VirtualNoitall · ‎04-15-2009

We applied this fix a couple of days ago now and it was better for a number of consecutive runs but then the deadlocks returned. I would be interested to hear what people see after a couple of days.

jmcdonald1 · ‎04-15-2009

AFIAK, it shouldn't, as all that is happening is that we are recreating the scripts with updated code.

Andulla · ‎04-16-2009

We applied the fix yesterday and it was ok till today. The deadlocks returned...:(

jparnell · ‎04-16-2009

Likewise. Although we havent had the vCenter service crash yet, the rollup job is still failing because of deadlocks.

jmcdonald1 · ‎04-17-2009

I have seen several cases reporting the jobs still failing after about a day when using the new scripts. I have reported this behavior to the appropriate people internally.

BigHug · ‎04-17-2009

Hi, Jonathan:

We have the same problem. Apply the KB fix but it doesn't help. The KB updates the purge procs, which are the step 2 in the job. But the deadlock occured during the step 1. The proc is stats_rollup1_proc. Maybe you can report it to the appropriate people to create the new procs of rollup. Just like the procs for purging in the KB. Thanks.

Andulla · ‎04-28-2009

Hi Jonathan,

this is correct! The deadlocks occure during step 1 (past day (or week) stats rollup).

do you have any updated purge_stat(1-3)_proc_mssql.sql procedures?

jmcdonald1 · ‎04-29-2009

The rollup procedures work together with one another. The KB mentioned will not totally mitigate the deadlocks from occurring (as many have expressed already), it is ment to help to reduce the ocurrences.

Let me explain a little further, so that everyone understands what is hapening. These deadlocks are caused by the large amount of activity (selects, etc.) that are done by the vCenter process combined with the delete/purge/rollup operations during regular vCenter operation. Therefore, the larger the environment, the more vCenter interaction and the higher the stats levels that are set, can cause them to occur more frequently.

Are they harmful? Not really (in my opinion) as long as the jobs do run. Most of the cases that I have personally seen show that the stats1 rollup job will fail at most a few times a day, but the majority of the time will run successfully. As long as they eventually run successfully, the data that is unprocessed in the stat1 table will be rolled up as expected. Of course when it does fail the downside is that the latest stats information will not show until a successful run (usually 30mins later, as the stats1 procedure is run once every 30mins).

VMware is definitely aware that people are seeing this and is working on the problem. Before any one asks, I have no timeframe as there is substantial amount of testing involved in any chages. Definitely keep your eyes posted on the release notes.

PS: One other point I wanted to mention is that apparently (from what I have personally heard) in vSphere 4 the chances of these type of deadlocks occuring is further reduced (hopefully eliminated completely!) due to some architecture changes..:).

Cheers,

/Jonathan

peetz · ‎04-30-2009

Hi jmcdonald,

we are also affected by this problem and have already implemented the updated SQL-scripts from the knowledge base.

Yesterday we increased the VC stats level for the 5-minute-interval from the default 1 to 3. After reading your explanation I expected the rollup-job failure to occur more frequently after this change. However, something very interesting happened: Since we did this change we have NOT experienced the failure anymore. It's 24 hours now since we did the change. Before we had about 10 failures a day.

Andreas

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

LeeCarey · ‎07-14-2009

Hi, we are having the same issue with the daily rollup job. Only occured after Update 4. When the job runs, the VI CLient is unusable and after 60hours running over the weekend it failed reporting the deadlock. I have implemeted the necessary kb fixes but still with no joy. Does anyone have any other suggestions. The DB is obviously starting to get bigger with the jobs not running and any perf data prior to the previous day cannot be viewed.

Thanks in advance

L