We have the same issue, as the procedures haven't been running for a couple of months as we were getting LOCK errors. This was fixed by upping the SQL LOCK limit to 100,000, but the Past Day stats rollupvCenter job on our 20GB vCenter DB has been running for one day so far. I may cancel the job and change the parameters in http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=3034858 to see if that helps.
We have 14 ESX hosts, 180 VMs, SQL 2005 VC with 20GB db.
The length of time completely depends on the amount of data in the database. Usually I would start by running the following query to check the number of rows in the table:
select count(*) from vpx_hist_stat1
If there is over 10 million rows it could take awhile to run. Heck, I saw a case one time where there was 4 billion rows in the database, which, at that point it was impossible to be able to catch up due to to the amount of new information that was coming in. After this, with vCenter 2.5 you can run:
select max(sample_time) from vpx_sample_time2
This query will show you the last sample time that was rolled up into the database. This is good, because because it is dynamic with the running of the script. You can therefore run this occasionally while the job is running and it will tell you the latest information that has been processed. You can then make a pretty good judgement on how long it will take to complete the rollup. Keep in mind that it will have to run a complete cycle before things will be back to normal. If it is cancelled it wil have to start over again.
I am actually doing a VMworld presentation in september on a technical deep dive for the databse. It has a multitude of information on this in it.
I did the upgrade from 2.5 to vSphere last Thursday and after that our transaction log started growing out of control even though the DB was set to "Simple", our VC DB is approximately 140GB, but the log grew to over 250GB! It took me a while but I determined that it's the purge_stat1_proc stored procedure that's never ending. I had to kill the job after it ran for over 24 hours because we were about to run out of disk space on the log lun. After I killed it, I can shrink the log, but I can't get the purge to complete. I did a select count(*) from vpx_hist_stat1 and we have over 1.1 billion rows in the table.
Any suggestions for how to clear this up?
Wow...that takes my second place as to the largest vpx_hist_stat1 i have ever seen...largest was 10Billion. Definitely sounds like the rollup jobs have not been running for quite some time. Unfortuantely the only way to keep the data is if they process the data. This could take some time, and will require a lot of diskspace, as the queries were not designed to process this much information all at the same time. The queries are meant to process every 30mins, 2hrs, and 24hrs, so that we never get htis much data in the tables. This is not to say that they won't work though.
This said, realistically from what I have experienced, it is unlikely that they will ever catch up due to the amount of new information coming into the database. You can check the date of the last sample interval that has been processed by running the following query:
select max(sample_time) from vpx_sample_time2
i am assuming that this will likely be several months ago. Also, this is good because you can watch the progress of how long it is taking to rollup...so if you start the job run the query a few times over a couple of hours, and see how quickly it is processing. By doing this you can make a determination of how logn it will take to complete the rollup without taking into account new data. If you see that you are chasing your tail, ie the amount of data coming in is greater than the amount of time it is taking to process the existing data, than my suggestion is to truncate the hist_stat1 table, since it will take a heck of alot less time than actually running a purge from the kb1000125 article:
truncate from vpx_sample_time1
NOTE: This will remove all unprocessed data that has not been processed by the daily rollup job.
Thanks for the info on this. I just ran into this problem and had 0.8 billion rows. Unsurprisingly a truncate was required as there was no way the rollup job could keep up.
We upgraded our 2.5 to 4 back in November and the backend is Oracle 10. Nobody noticed that the performance stats weren't working until a couple days ago when I was probing around looking for disk metrics. Is it a bug in the upgrade process that does not re-create the the purge_stat1-2-3_proc jobs? Seems like this is a problem that should be tracked as a bug, no?
Anyhow, our DBA kicked off the 1st purge script yesterday and today it is still running but is now only about 8 days back so should be done later today. I'd like to keep as much of the perf data as possible so when the other 2 purge jobs run, will the size of the DB shrink back down or does it need to be pruned manually? I think he extended it a couple times (we hit the wall early on and bumped it out to 11GB). Is there an article or set of steps to go through once the jobs have all caught up and we are on track to make the DB a bit more tiny?