Used Capacity on one of our VDP reached 96.46% and Backup Scheduler was stopped. I'm able to start Backup Scheduler. But although I deleted many backups the used capacity is still the same. Is any way to freeing up space or should I better do rollback to the last validated checkpoint?
See if you can delete some of your older backups and give enough maintenance window for the garbage to clean up.
Note -
If you are using VDP 5.1, then increase the blackout window temporarily - For garbage collection (cleanup deleted backups from storage)
If you are using VDP 5.5, then just increase the maintenance window temporarily.
On longer term you may have to rework on the retention policies so that you'll not end up with too many backups.
Thanks for an aswer.
I'm using VDP 5.1.10.32 so I've increased the blackout window. Must I wait to next beginning of the blackout window or can I push garbage collection anyway? I can see it like failed this morning:
505923 2013-12-05 08:15:06 CET ERROR 4202 SYSTEM PROCESS / failed garbage collection with error MSG_ERR_DISKFULL
I'm not sure if "ConnectEMC is not running." error is related to it too. Can I try to start it running "dpnctl start mcs"? I'm not familiar with Avamar too much.
Check reply-6 in this thread - https://community.emc.com/thread/116610
Thanks again. But it looks I was caught in a trap.
Checkpoints are validating but old:
root@vm-vdp:~/#: cplist
cp.20131202100237 Mon Dec 2 11:02:37 2013 valid rol --- nodes 1/1 stripes 1916
cp.20131202103632 Mon Dec 2 11:36:32 2013 valid rol --- nodes 1/1 stripes 1916
No hfscheck yet (since last reboot?) and gsan status is degraded:
root@vm-vdp:~/#: status.dpn|less
Čt pro 5 15:26:42 CET 2013 [vm-vdp.racom.cz] Thu Dec 5 14:26:42 2013 UTC (Initialized Wed Nov
7 19:55:37 2012 UTC)
Node IP Address Version State Runlevel Srvr+Root+User Dis Suspend Load UsedMB Errlen
%Full Percent Full and Stripe Status by Disk
0.0 192.168.20.17 6.1.81-130 ONLINE fullaccess mhpu+0hpu+0000 2 false 0.28 3594 27424967
62.6% 62%(onl:644) 62%(onl:648) 62%(onl:642)
Srvr+Root+User Modes = migrate + hfswriteable + persistwriteable + useraccntwriteable
All reported states=(ONLINE), runlevels=(fullaccess), modes=(mhpu+0hpu+0000)
System-Status: ok
Access-Status: admin
No checkpoint yet
No GC yet
No hfscheck yet
Maintenance windows scheduler capacity profile is active.
WARNING: Scheduler is WAITING TO START until Fri Dec 6 08:00:00 2013 CET.
Next backup window start time: Fri Dec 6 20:00:00 2013 CET
Next blackout window start time: Fri Dec 6 08:00:00 2013 CET
Next maintenance window start time: Fri Dec 6 16:00:00 2013 CET
root@vm-vdp:~/#: dpnctl status
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: gsan status: degraded
dpnctl: INFO: MCS status: up.
dpnctl: INFO: Backup scheduler status: down.
dpnctl: INFO: axionfs status: up.
dpnctl: INFO: Maintenance windows scheduler status: enabled.
dpnctl: INFO: Unattended startup status: enabled.
I've tried to get GC in active state along to reply-5 but it looks like used capacity 96.4% is too high. I suppose GC not to start in the morning, am I right?
root@vm-vdp:~/#: avmaint config --ava | grep diskrep disk
disknocreate="90"
disknocp="96"
disknogc="85"
disknoflush="94"
diskwarning="50"
diskreadonly="65"
disknormaldelta="2"
freespaceunbalancedisk0="30"
diskfull="30"
diskfulldelta="5"
balancelocaldisks="true"
root@vm-vdp:~/#: avmaint config disknogc=97 --ava
2013/12/05-13:55:10.94029 [avmaint] ERROR: <0949> Command failed because these config values do not meet the following criteria:
2013/12/05-13:55:10.94040 [avmaint] ERROR: <0001> 0 < diskwarning(50) < diskreadonly(65) < disknogc(97) < disknocreate(90) < disknoflush(94) < disknocp(96) < 100
ERROR: avmaint: config: server_exception(MSG_ERR_INVALID_PARAMETERS)
root@vm-vdp:~/#: avmaint config disknocp=99 --ava
2013/12/05-13:55:41.90331 [avmaint] ERROR: <0949> Command failed because these config values do not meet the following criteria:
2013/12/05-13:55:41.90342 [avmaint] ERROR: <0001> disknocp(99) <= diskfulldelta(5 -> 96.5) < diskfull(30 -> 97.0) < poolnocreate(20 -> 98.0) < 100
ERROR: avmaint: config: server_exception(MSG_ERR_INVALID_PARAMETERS)
Hi,
Have You resolved your issue ??
I'm asking because I have the same problem, 96 % used capacity. Is there only solution to open a support case ??
Regards,
Sebastian Ulatowski
I've deployed new VDP and started new backup jobs. It was the most simple and fast way for me. I didn't open a support case.
Hi,
I have the same problem with VDP 5.5.
There is solution to freeing up space ?
I'm afraid it's avalaible for VDPA only. Try to conntact support if deploying of new VDP isn't possible for you.
FYI, this is still an issue in VDP 6.1.2.
My appliance hit 96.15% capacity. Backups failed. The 'Backup Scheduler' service will not start.
I have a case open with VMware. We have spent 2+ hours on a WebEx trying to get this thing back online. Finally escalated the case to EMC. Waiting on a resolution.
Here goes the sad story with happy end about VDP 6.1.2.19. One unlucky day I have fed a couple of OLAP VMs to VDP and it choked failing to deduplicate properly. It have had ended up with nodes full at 98, 97 and 98 % respectively.
I do not know what exactly helped but here is a full list of my actions (add reboots as needed):
1. Delete big backups, run manual checkpoint, integrity check and garbage collection. Everything failed with MSG_DISK_FULL.
2. Rollback to earlier checkpoint, run manual checkpoint, integrity check and garbage collection. Everything failed with MSG_DISK_FULL.
3. Modify configuration threshold amounts to allow garbage collection run (as described above). Same errors when trying to set values to 99%.
4. Expand storage! Wizard went successfully. Only node1(/dev/sdc1) expanded . Run manual checkpoint, integrity check and garbage collection. Only checkpoint succeeded, hfscheck and gc failed with MSG_DISK_FULL.
5. At this point I have up and let the system run for a weekend.
6. On Monday it has magically repaired itself. Thare was a good checkpoint, good hfscheck and good gc. Admin mode persisted although.
7. Several times rebooting and running manual checkpoint including unmounting disks and running xfs_check helped at last. Fullaccess mode was there.
8. The last bit was using xfs_growfs on /dev/sdb1 and /dev/sdd1 to fix wrong size of nodes.
Edit: I think that checkpoint rollback and waiting was enough...