Hi,
Simple question: What are VDR 2.0 monitoring best practices and tools? As there are no alarms in vCenter, I would like to integrate some check in Nagios to get informed when something is wrong with VDR backuping.
I've found these solutions:
http://www.jules.fm/Logbook/files/vdr-log-analyser.html
https://www.monitoringexchange.org/inventory/Check-Plugins/Software/Backup/VMWare-Data-Recovery
http://exchange.nagios.org/directory/Plugins/Backup-and-Recovery/VMware-Data-Recovery/details
however, neither of them is really straightforward. What are your best practices, which tools are you using to monitor VDR?
Hi,
I've decided to slightly modify the script found here: https://www.monitoringexchange.org/inventory/Check-Plugins/Software/Backup/VMWare-Data-Recovery (the modified plugin is in the attachment). I've extracted the Phyton code out of the shell script and created standalone Phyton Nagios Plugin (you can find it in the attachment). The plugin parses the VDR logs and reports on errors. I did quite some testing on 4 different VDR appliances (some of them had Jobs with errors) and the reporting about the errors was accurate.
Here are short install instructions (you'll have to hack VDR appliance a little bit, but the hacks are independent from the appliance itself and they should survive possible future appliance updates and upgrades):
I.) VMware Data Recovery appliance configuration:
1. install basic Nagios NRPE service
rpm -ivh http://download.fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm
yum -y install nrpe
yum -y install nagios-common nagios-plugins
2. configre start at boot and start the service
chkconfig nrpe on
service nrpe start
3. open firewall for NRPE service
vi /etc/sysconfig/iptables
add this rule just after the "--dport 22" rule:
-A VDR-Firewall-1-INPUT -p tcp -m tcp --dport 5666 -j ACCEPT
service iptables restart
4. put check_vmware_data_recovery.py (plugin is in the attachment) into /usr/lib64/nagios/plugins and set permissions
chmod +x /usr/lib64/nagios/plugins/check_vmware_data_recovery.py
5. test the plugin
/usr/lib64/nagios/plugins/check_vmware_data_recovery.py "Integrity Check"
6. make sure nrpe user is able to read /var/log/messages
6.1: this is quite ugly, but easy:
# chmod 0744 /var/log/messages
or
6.2: allow nrpe to run as root:
# visudo
comment the line "Defaults requiretty" and add a line at the bottom:
nrpe ALL=(ALL) NOPASSWD: /usr/lib64/nagios/plugins/
7. add command to NRPE configuration
vi /etc/nagios/nrpe.cfg
command[check_vmware_data_recovery]=/usr/bin/sudo /usr/lib64/nagios/plugins/check_vmware_data_recovery.py "Integrity Check"
8. reload NRPE service
service nrpe reload
II.) Nagios server configuration:
1. define a new command on your Nagios server
define command{
command_name check_vmware_data_recovery
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_vmware_data_recovery
}
2. add a check to your VMware Data Recovery appliance checks:
define service{
use generic-service
host_name vdr-01
service_description vmware data recovery
check_command check_vmware_data_recovery
}
3. manually test NRPE command
/usr/lib/nagios/plugins/check_nrpe -H [your VDR appliance IP] -c check_vmware_data_recovery
---
That's it ... if you have Nagios in you environment this should be an easy, straight-forward and good enough monitoring tool for your VMware Data Recovery Appliances health.
---
EDIT@2012-12-03: I've uploaded a new version of the plugin