VMware Cloud Community
cryptonym
Enthusiast
Enthusiast

Event ID 9002 in event log, SQL Server bombing on Virtual Center purges

I had many of these errors log to the event viewer.

"The transaction log for database 'VCDB' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases"

Invariably, this field said "NOTHING". However, googling this, lead me to the trail on "long running transactions".

Well, in each case what was failing was the "purge" routines - the housekeeping routines that delete old data. I dug into the SQL in the procedures, and I think it is safe to say that a different person wrote the "purge" then who wrote the "rollup" (which aggregates data in summary form). The deletes were, for some reason known only to a mad man, transactionalized. This is insane when working with millions of records, and then using "not exists" clauses to scan a lookup table on top of that. Lock escalation ensues, and after running 5 to 7 minutes, the system is exhausted and fails. The rollup code, while not perfect, has much fewer problems, and since it uses cursors to access the data, it seems very efficient.

FYI, I had about 1.5 million records and 15000 time slots active, so do the math on how much work they are running in a single transaction (1,500,000 x 15,000). WOWSIE!

The person who wrote the purge routines also introduced a bug in that they evaluate the current system time to calculate whether an item is old. They do this for every record, and in every "not exists" subselection. This is NOT how to do subselects, any DBA can tell you that you don't do unneeded calculations in the inner loop. Instead, you calculate the value once, then just use that value each time. Additionally, this coder created a "sliding window", in that as the process runs (especially in the case of the 5-7 minute runs I seen) more and more records come into scope. Using a "calculate once and forget it" value sets us up to delete from a fixed point back.

Since they are simple house keeping routines, and since they only look at a point in time and look backwards, they will always work. If one fails, then on the next run it will purge what was missed. What the transactions were forcing to occur, and for no reason I can see, is that if any one record failed to be deleted (i.e. deadlock, in use, etc), then potentially hundreds of thousands of records that were deleted would be undeleted, thus chewing even more CPU up. And, once failures started, almost all subsequent runs would fail, since each run had to try to remove all those that were rolled back in on the last failure.

I removed the transaction blocks, and also parameterized the points where they called dateadd(...). Net result is that instead of running for 7 minutes and bombing, it runs in a few seconds and actually works.

I'm sending this up. Hopefully, there coders will look into this. If they come back and say that I am incorrect on this, then I'll update this thread.

So, here are the replacement scripts I came up with. Just run them into your VCDB, or have your DBA do it. If you choose to try this, do so at your own risk, but it works for me.

set ANSI_NULLS ON

set QUOTED_IDENTIFIER ON

GO

ALTER procedure [[dbo]].[[purge_stat1_proc]] AS

SET NOCOUNT ON

declare @v_purge_interval int

declare @v_purge_dt datetime

begin

select @v_purge_interval=interval_length

from vpx_STAT_interval_def WHERE INTERVAL_SEQ_NUM=1

select @v_purge_dt=dateadd(second,-@v_purge_interval,getutcdate())

delete from vpx_sample_time1

WHERE SAMPLE_TIME<=@v_purge_dt

AND ROLLUP_COUNTER IS NOT NULL

delete from VPX_HIST_STAT1

WHERE NOT EXISTS(SELECT 1 FROM vpx_sample_time1 VTD

WHERE vtd.TIME_id=VPX_HIST_STAT1.time_id)

delete from VPX_HIST_STAT1

WHERE NOT EXISTS(SELECT 1 FROM vpx_stat_counter TMP

WHERE TMP.counter_id=VPX_HIST_STAT1.counter_id)

end

GO

set ANSI_NULLS ON

set QUOTED_IDENTIFIER ON

GO

ALTER procedure [[dbo]].[[purge_stat2_proc]] AS

SET NOCOUNT ON

declare @v_purge_interval int

declare @v_purge_dt datetime

begin

select @v_purge_interval=interval_length

from vpx_STAT_interval_def WHERE INTERVAL_SEQ_NUM=2

select @v_purge_dt=dateadd(second,-@v_purge_interval,getutcdate())

delete from vpx_sample_time2

WHERE SAMPLE_TIME<=@v_purge_dt

AND ROLLUP_COUNTER IS NOT NULL

delete from VPX_HIST_STAT2

WHERE NOT EXISTS(SELECT 1 FROM vpx_sample_time2 VTD

WHERE vtd.TIME_id=VPX_HIST_STAT2.time_id)

delete from VPX_HIST_STAT2

WHERE NOT EXISTS(SELECT 1 FROM vpx_stat_counter TMP

WHERE TMP.counter_id=VPX_HIST_STAT2.counter_id)

end

GO

set ANSI_NULLS ON

set QUOTED_IDENTIFIER ON

GO

ALTER procedure [[dbo]].[[purge_stat3_proc]] AS

SET NOCOUNT ON

declare @v_purge_interval int declare @v_fourth_purge_interval int

declare @v_purge_dt datetime declare @v_fourth_purge_dt datetime

begin

select @v_purge_interval=interval_length

from vpx_STAT_interval_def WHERE INTERVAL_SEQ_NUM=3

select @v_fourth_purge_interval=interval_length

from vpx_STAT_interval_def WHERE INTERVAL_SEQ_NUM=4

select @v_purge_dt=dateadd(second,-@v_purge_interval,getutcdate())

select @v_fourth_purge_dt=dateadd(second,-@v_fourth_purge_interval,getutcdate())

delete from vpx_sample_time3

WHERE SAMPLE_TIME<=@v_purge_dt

AND ROLLUP_COUNTER IS NOT NULL

delete from VPX_HIST_STAT3

WHERE NOT EXISTS(SELECT 1 FROM vpx_sample_time3 VTD

WHERE vtd.TIME_id=VPX_HIST_STAT3.time_id)

delete from VPX_HIST_STAT3

WHERE NOT EXISTS(SELECT 1 FROM vpx_stat_counter TMP

WHERE TMP.counter_id=VPX_HIST_STAT3.counter_id)

delete from vpx_sample_time4

WHERE SAMPLE_TIME<=@v_fourth_purge_dt

AND ROLLUP_COUNTER IS NOT NULL

delete from VPX_HIST_STAT4

WHERE NOT EXISTS(SELECT 1 FROM vpx_sample_time4 VTD

WHERE vtd.TIME_id=VPX_HIST_STAT4.time_id)

delete from VPX_HIST_STAT4

WHERE NOT EXISTS(SELECT 1 FROM vpx_stat_counter TMP

WHERE TMP.counter_id=VPX_HIST_STAT4.counter_id)

end

0 Kudos
1 Reply
lschweiss
Contributor
Contributor

This looks like the most elegant solution out there, but there has been no feedback of its success or failure with others.

It seems VMware has not incorporated this or any other clean up of this SQL code.    I'm on vSphere 5.0 U1 and the problems seem to be just as bad.

Any one know if this patch works under 5.0?

0 Kudos