JiriGlumbik
Contributor
Contributor

VDP suddenly stopped to schedule backups

Hi,

I installed VDP appliance (5.1.1). Registered to the vcenter, setup several backup jobs. The appliance worked for 2 months quite reliably, but suddenly  something happened and I cannot start any job. Neither scheduled  not manual. The Web GUI reports that the job was succesfully scheduled, but the backup does not start. The problem lasts 2 weeks.

I have restarted VDP several times, executed integrity checks, but it does not help.

All appliance services are running (checked by health check by VDP web interface):

status2.jpg

VDP status in Vcenter Web Client is "Normal" :

status1.jpg

I have noticed, that in /space/avamar/var/mc/server_log/mcserver.log.0 there are lot of  repeating messages :

03/07-02:17:59.00667 com.avamar.mc.wo.JobScheduler._gotVmWork
WARNING: Backup job skipped because server is read-only
03/07-02:17:59.00667 com.avamar.mc.wo.JobScheduler.gotReadOnlyWork
FINE: bug 4261 --- checking for readonlywork
03/07-02:17:59.00667 com.avamar.mc.cr.bcConnection.sendWOReply

The time of the message is evidently in backup window. It seems, that the appliance is internally in readonly mode for some reason, but i am not able find out why.

     Any ideas ?

          Thanks

               Jiri

Tags (2)
12 Replies
GSparks
Enthusiast
Enthusiast

Jiri -

Your VDP has moved to a read only state because you are reaching capacity on the appliance.

The best line of action at this point would be to submit an SR to VMWare support to have them work with you to remove backups / reduce the retention policy and free up some space on the appliance.

-Greg Sparks
0 Kudos
JiriGlumbik
Contributor
Contributor

Greg-

Thanks for an advice.

I will try to free some space.

What is the treshold ? I have found 95% in a documentation.

I would welcome,that such state of the appliance is reported in GUI.

     Regards

          Jiri

0 Kudos
GSparks
Enthusiast
Enthusiast

Jiri -
The GUI is actually going to be updated with some new alarms in the next release that will warn users as they approach these thresholds.

With the VDP appliance, we define two different types of capacity.  One is the raw capacity of the physical disk and the other capacity is what the VDP reports (referred to as "user capacity").

When the user capacity hits 80%, the VDP continues to function, but the end user should start to look at removing older restore points from the appliance and allowing for garbage collection to free up physical disk space. 

When the user capacity hits 95%, the VDP will allow existing backups to complete, but new backup activities are suspended.

When the user capacity hits 100%, then the VDP transitions to read-only mode and no new data is allowed.

The only process that will actually free up disk space is garbage collection.  Removing restore points / reducing the retention policy will allow for garbage collection to being to reduce disk usage, but if the same data is used across other backups, then the data you are expecting to be removed may actually remain.

This is why we suggest contacting support with any capacity issues, as they have tools that can help analyze the appliance to identify VMs with the highest change rates and this will help define how to reduce capacity the fastest.

-Greg Sparks
0 Kudos
mobcdi
Enthusiast
Enthusiast

I'm encountering the same message but my capacity has dropped back to 76% after hitting 99% and still won't exit read only mode.

The UI on web client doesn't indicate its in read only mode but checking the mcserver.log file shows the same warning message about the server being in read-only mode.

I've

  • reduced the number restore points
  • reduced retension period for my backup jobs
  • increased the blackout period
  • rebooted the appliance after getting to 86% capacity used
  • automatic integrity checks complete successfully and in a short period of time

Manual integrity checks can't be performed at this time even though its outside my blackout period

Is there anything else I can do to get the backup jobs running again?

0 Kudos
markko
Contributor
Contributor

Same thing here. Appliance usage was 82% before upgrade and backups were working. After upgrade I see same messages on /usr/local/avamar/var/mc/server_log/mcserver.log.* files

05/13-00:09:10.00834 com.avamar.mc.wo.JobScheduler._gotVmWork

WARNING: Backup job skipped because server is read-only

05/13-00:09:10.00834 com.avamar.mc.wo.JobScheduler._gotVmWork

WARNING: Backup job skipped because server is read-only

05/13-00:09:10.00834 com.avamar.mc.wo.JobScheduler.gotReadOnlyWork

In the /usr/local/avamar/var/vdr/server_logs/vdr-server.log I see messages which weren't there before:

2013-05-13 01:13:15,938 INFO  [Timer_general]-schedule.TaskMonitor:  Work order unkown state or queued. Task not created.

Appliance usage is right now 80.31% . Before there was 95% threshold when backup activities were suspended, but now it is 80% ? And can't find any way to clear read-only status.

0 Kudos
mobcdi
Enthusiast
Enthusiast

According to the admin guide if your usage is over 80% you should use the following guidelines for capacity management:

 Stop adding new virtual machines as backup clients

 Delete unneeded backup jobs

 Reassess retention policies to see if you can decrease retention policies

 Consider adding additional vSphere Data Protection (VDP) Appliances and balance backup jobs between

multiple appliances

I myself had to decrease restore points to get below 80% but I still ended up opening a case with vmware support. My appliance wasn't upgraded so I don't want to be making a direct comparsion with your situation but if you still have problems when your usage is below 80% and you have completed integrity checks then I would recommend opening a case with vmware support to check if its related to an EMC issue of unacknowledged events within the avamar software

As a result of hitting the 80% usage limit too often for the number and frequency of backups I needed to take I ended up deploying a larger appliance and I am keeping an closer eye on capacity usage

0 Kudos
markko
Contributor
Contributor

I hoped that I would be able to do backups even when Appliance starts warning about space. Its not good if VDP stops backups without any reasonable message and just because used space reached 80%. I had earlier problems when Scheduler service didn't start because used space was 95% but I could acknowledge messages with "mccli event ack" and service started again. Now even that won't help. 80% isn't yet so full that you could not fit any backups there. Maybe I have such retention policy and VM capacity that space usage stays between 80-85% always. It should warn me about it but not stop doing backups altogether.

But I wait till some older backups are deleted and space is reclaimed, should be about day or max two.

0 Kudos
markko
Contributor
Contributor

Space usage went below  80% but still no backups and still same errors. Deleted appliance and redeployed new one because didn't have time to wait. I will wait and see if any other appliance goes above 80% and stops doing backups. If problem is repeatable then I guess I must open support case.

0 Kudos
fcardarelli
Contributor
Contributor

Hi to everyone, i run the next command in the VDP guest OS.

mccli event ack --include=22631

More detailed information in the KB:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=205123...

This resolved the issue for me.

I with it helps.

cmartin24
Enthusiast
Enthusiast

This worked for me.  Thank you!

I had the same problem and after manually running integrity checks and garbage collections there was no change.  The mcserver.log file noted: "WARNING: Backup job skipped because server is read-only."

0 Kudos
AgentCK
Contributor
Contributor

Not to bother on an old task, but not long ago I had the same problems. With no expirience on VDP at the beginning of the support-journey I reached a very high skill with the recovery of VDPs now.

A hard learned lesson was to monitor the physical capacity. Therefore I scheduled cron jobs to save the output of "status.dpn", "df" and "avmaint nodelist|grep fs-perc" into two logfiles. One is started five minutes before the backup window closes, the second one before the maintenance window closes. The logfiles will be imported then into excel. So I can observe the values with and without garbage-collection. But I never found a way to get the "user capacity" value over the console. (@GSparks: Any Ideas?)

Another hard learned lession was that each deletion of backups and/or shortening the retention time will cause a used space increment in first when creating the checkpoint, but before running the garbage collection. The needed capacity was often reclaimed after some days. But before I was able to reclaim the disk space, it was often necessary to delete old checkpoints and run GC then, before I created new checkpoints.

After all I learned that it is often better to deploy a new VDP same configured to the broken one running parallel to it and when the retention time on the old VDP expired to delete it. Meanwhile we are running 5 VDPs instead of two before, all backing up in different storages than the machines.

But the last VDP I deployed is really fu***** me up! After the first configuration it is throwing always an "[004] The VDP appliance datastore is approaching maximum capacity" alert, altough the NAS has 8TB free and the VDP was never run any backup. Before opening a SR I will delete it and deploy another one. This way is still faster than support, just on a fresh non-productive installation. But if someone has any ideas on this issue, I would appreciate it to hear them.

0 Kudos
NTShad0w
Enthusiast
Enthusiast

VDP is based on Avamar, and Avamar in fact need to acknowledge some events to pass startins some of services like backup scheduler (and dispatcher)...

so acknowledging events is one of right answers Smiley Happy

0 Kudos