Management Passive Checks with Nagios Lead image: Lead Image © Lisa Young, 123RF.com
Lead Image © Lisa Young, 123RF.com
 

Configuring Nagios for passive script notification

Smooth Check

Why spam yourself with useless notifications every time a script completes successfully? You can use Nagios to screen the notices and just send you the ones that need action. By Alessio Ciregia

Suppose you have developed a magnificent script (a Bash script, for example). This script will execute every night to dump a MySQL database or to rsync your valuable files to another server.

A common strategy is to use email as the way to get notified about the script results. If you wrote a good script, and your systems are very strong, the script will never exit with an error condition, and every morning you will find a message stating OK: no error running my beautiful backup script in your inbox.

This message is not a problem – as long as you only have one script sending notifications one time in a day. Suppose, however, that you have many scripts, with many backups, and all these scripts are notifying you via email. Your mailbox will fill up every morning with the same message. The first day, you will be impressed, but soon you will lose interest in reading all those identical messages. Your next step will be to create a filter in your mail client to put these messages in a subfolder, and after that you won't read them anymore, because you have basically just sent yourself a lot of spam. The worst part is, in the flood of useless emails, you risk of not seeing a real error notification that you really should be reading.

One solution is to send a message only in the case of an error; however, if you configure your script to send an email if it encounters an error during the execution, you have no guarantee that the script actually executed. If the script crashes or aborts before the line that sends the email notification, you will never know.

A far better solution is to let Nagios  [1] listen for the notifications and only notify you if an error occurs or a message indicating success is not received.

Nagios will help you:

The Nagios passive check technique described in this article uses Nagios Service Check Acceptor (NSCA)  [2]. This article assumes you have a working knowledge of Nagios. If you are new to the Nagios network monitoring system, see the resources at the Nagios website.

Getting Started

I'll start by defining a host (Listing 1). This host can be the server where your scripts runs, the server involved in the backup, or a dummy host (in Nagios, you can define a host and associate a check_command with that host).

Listing 1: Host Definition

01 define host {
02      host_name                  yourserver
03      alias                      yourserver
04      address                    192.168.0.100
05      check_command              check-host-alive
06      contact_groups             contacts
07      use                        check_5min_24x7,notify_24h_24x7
08 }

This article assumes you have some background with Nagios configuration, but I'll start with a little refresher for those who haven't tried it in a while. Nagios lets you define the objects you want to monitor, like hosts and services. Instead of issuing repetitive directives for each object, you can set up templates to use in common situations.

As you can see in Listing 1, I use the check templates check_5min_24x7 and notify_24h_24x7. The check_5min_24x7 template (Listing 2) checks whether the host is alive (the check-host-alive command uses a ping) every five minutes, for the period defined in the check_period attribute (in this case, 24x7). The 24x7 time period template (Listing 3) tells Nagios to perform the related action (a check or a notification) every day of the week at any hour. (If you do not want to check that a host is alive, you can define a check template that always returns OK (see the Nagios check_dummy plugin.)

Listing 2: check_5min_24x7

01 define host {
02         name                            check_5min_24x7
03         register                        0
04         max_check_attempts              3
05         check_interval                  5
06         retry_interval                  1
07         active_checks_enabled           1
08         passive_checks_enabled          1
09         check_freshness                 1
10         freshness_threshold             1800
11         check_period                    24x7
12         check_command                   check-host-alive
13 }

Listing 3: 24x7

01 define timeperiod {
02         timeperiod_name                24x7
03         alias                          24x7
04         sunday                         00:00-24:00
05         monday                         00:00-24:00
06         tuesday                        00:00-24:00
07         wednesday                      00:00-24:00
08         thursday                       00:00-24:00
09         friday                         00:00-24:00
10         saturday                       00:00-24:00
11 }

The template notify_24h_24x7 (Listing 4) defines notification behavior, and notify_24h_24x7 tells Nagios how to behave in the case of a state change (i.e., the script exits with an error, and the state changes from 0 OK to 3 CRITICAL). This configuration tells Nagios, every day of the week, at any hour of the day (as seen for the 24x7 timeperiod in Listing 4), to send a notification; then, if no further state change occurs, to wait a period of 86400 seconds (24 hours), as defined in the notification_interval, before sending another notification (in this case, another email).

Listing 4: notify_24h_24x7

01 define service {
02         name                    notify_24h_24x7
03         register                0
04         notification_interval   86400
05         notification_options    w,u,c,r,f,s
06         notification_period     24x7
07 }

The next step is to define a service template (Listing 5). This template will use the freshness_threshold option to raise an alert if Nagios does not receive any notification from your script over a period of 93600 seconds (26 hours). Suppose your script is executed by cron every day at 1:00am, that is, every 24 hours: The 26-hour threshold gives the script two hours to complete. (Obviously, you must adjust this period for your own situation.)

Listing 5: Service Template

01 define service {
02        name                    check_passive_26h_24x7
03        register                0
04        max_check_attempts      1
05        check_interval          1
06        retry_interval          1
07        active_checks_enabled   0
08        passive_checks_enabled  1
09        notifications_enabled   1
10        check_freshness         1
11        freshness_threshold     93600
12        check_period            24x7
13 }

The next step is to define a service template related to the notification (Listing 6). This template defines how often to send notifications in case of warning or critical status: Once a day is sufficient.

Listing 6: Notification Template

01 define service {
02         name                    notify_24h_24x7
03         register                0
04         notification_interval   1440
05         notification_options    w,u,c,r,f,s
06         notification_period     24x7
07 }

Now define a service related to your script (Listing 7).

Listing 7: Service Template for the Script

01 define service {
02    service_description           Powerful_backup
03    check_command                 passive_backup!2!"Warning: no passive check received in the expected period"
04    host_name                     yourserver
05    contact_groups                contacts
06    flap_detection_enabled        0
07    event_handler_enabled         0
08    use                           check_passive_26h_24x7,notify_24h_24x7
09 }

The command in Listing 8 points to a script in the $USER1$ directory (in the Debian package, this directory is /usr/lib/nagios/plugins/).

Listing 8: Command Template for Passive Check

01 define command {
02         command_name    passive_backup
03         command_line    $USER1$/nobackupreport.sh $ARG1$ $ARG2$
04 }

The script in Listing 9 simply prints the string you pass to it and exits with the exit status you pass to it. (In this example, the script will print All ok and it will exit with 0, the OK exit code for Nagios plugins).

Listing 9: nobackupreport.sh

01 #!/bin/sh
02
03 status=$1
04 shift 1
05
06 /bin/echo $@
07
08 exit $status

Now you have to install and start the NSCA service. In Debian, that's:

apt-get install nsca
/etc/init.d/nsca start

For simplicity, I use the default /etc/nsca.cfg configuration file. You can define the decryption method (just obfuscation by default) and the listening port (5667 by default). This service will listen for incoming passive checks. Now, on the machine where your powerful backup script resides, you must install the nsca-client package (check your distribution or operating system). You must edit your script to add the section related to Nagios. Pipe to the NSCA client command a string in the form:

"host;service;state;message"

where host is the hostname, service is the service name previously defined in the Nagios configuration, state is a Nagios status code (0 OK, 1 warning, 2 critical), and message is the message that will appear in the notification (on the Nagios web page as well in the email message).

Listing 10 is a Bash script showing the passive check.

Listing 10: Passive Check Bash Script

01 #!/bin/bash
02
03 mysqldump and so on
04
05 EL=$?
06
07 if [ $EL -ne 0 ]
08 then
09   MESSAGE="Problem with mysqldump"
10   STATE=2
11 else
12   MESSAGE="Backup OK"
13   STATE=0
14 fi
15
16 echo "yourserver;Powerful_backup;$STATE;$MESSAGE" | /usr/sbin/nsca \
   -H yournagiosserverIP -p 5667 -d ";" -c /etc/send_nsca.cfg

Conclusion

You can tweak this passive check configuration for your own needs. This technique offers a useful way to handle notifications and checks coming from your scripts. You can eliminate spam messages, reduce the risk of losing important notifications, and make sure your scripts are really executed.

The NSCA client is packaged for all Linux distributions, Solaris, and other Unix variants; you also can use it in a SmartOS global zone without having to install a package. The send_nsca utility is also available for Windows. You can use the JSend NSCA Java API to send Nagios passive checks from within your Java applications, and APIs also exist for other languages, such as Ruby, PHP, and Perl.

If you don't want to take the time to learn the details of Nagios manual configuration, you might want to experiment with Nagios web GUI configuration tools, such as NConf [3].