Nuts and Bolts Daemon Monitoring Lead image: © Shariff Che'Lah, 123RF.com
© Shariff Che'Lah, 123RF.com
 

Monitoring daemons with shell tools

Watching the Daemons

Administrators often write custom monitoring programs to make sure their daemons are providing the intended functionality. But simple shell tools are just as well suited to this task, and not just for systems that are low on resources. By Harald Zisler

Unix daemons typically go about their work discreetly in the background. The process table, which is output with the ps command, only shows you that these willing helpers have been launched, although in the worst case they could just be hanging around as zombies. Whether or not a daemon is actually working is not something that the process table will tell you. In other words, you need more granular diagnostics. The underlying idea is to write a "sensor" script for each service that performs a tangible check of its individual functionality.

Because almost every program outputs standardized exit codes when it terminates, you can use Unix conventions. 0 stands for error-free processing, whereas 1 indicates some problems were encountered. This value is stored in the $? shell variable, which a shell script evaluates immediately after launching the sensor.

Various programs are suitable for automated, "unmanned" access to the service provided by a given daemon; all of them will run in the shell without a GUI. These programs often provide an option (typically -q) that suppresses output, and this is fine for accessing the exit code. Error logs are obtainable by redirecting the error output to a file or, if available, by setting the corresponding program option.

The only thing left to do is to find the matching client program test the functionality of each service.

Web Servers

To check a web server, you could use wget. The shell script command line for this would be:

wget --spider -q ip-address

The --spider option tells wget to check that the page exists but not to load it. Defining the IP address instead of the hostname avoids a false positive if DNS-based name resolution fails for some reason.

Almost all known databases include a client program for the shell – for example, mysql for MySQL or psql for PostgreSQL. Alternatively, you can use ODBC to access the database in your scripted monitoring, such as the isql tool provided by the Unix ODBC project.

For ease of access, you might need to set up a (non-privileged) user, a database, and a table for the test query on the database server. If you choose the ODBC option, you also need a .odbc.ini file with the right access credentials.

The psql shell client for the Postgres database also poses the problem of non-standard exit codes. 1 stands for an error in the query, although the connect attempt has been successful; 2 indicates a connection error.

A connection test with psql would look like this:

psql -U User -d Database -c "select * from test_table;"

For ODBC access, you would need to pipe the SQL query to the client:

echo "select * from test_table;" | isql ODBC_data_source user

For the cups printer daemon, lpq gives you a simple method of checking whether the daemon is alive. If you need to check access to individual printers, you additionally need to provide the print queue name and then grep the exit code. To make sure the exit code complies with this behavior, Grep checks the output that you receive if the printer is active:

lpq -Pprinter | grep -q "printer is ready"

To match the output from lpq, you need to modify the search string for grep.

The ping command checks network connections. The exit error codes differ, depending on your operating system. The FreeBSD ping uses 2, the Linux ping uses 1.

The number of test packets is restricted by the -c packets option; this improves the script run time and avoids unnecessary network traffic. If you use the IP address as the target, you avoid the risk of false positives from buggy name resolution.

ping -c1 ip_address

Sensor scripts can obviously be extended to cover many other system parameters, such as disk space usage (df), logged in users (who), and much, much more.

If an error or threshold value infringement occurs, the script can use this information to generate a message and notify the system administrator. The message text should include the hostname, date, and time. Messages can be stored in a file to which the administrator has permanent access. To allow this to happen, you simply have to display the logfile in a terminal and use tail -f, but other forms of communication are also possible – texting, for example.

If the shell script has the correct privileges, it can become involved and restart a daemon, remove block files, or even reboot the whole system. Because you should avoid running this kind of script as root, you can instead set up special users and groups to own the script and the process (which is the case with many daemons).

Database Restart

The sample script in Listing 1 monitors an active database instance and notifies the administrator if the database happens to fail and then is successfully restarted (Figure 1). If it can't start the daemon, it waits for the administrator to step in and handle the situation.

Listing 1: Database Monitoring

01 #! /bin/sh
02
03 while true
04 do
05
06 zeit=$(date +%d.%m.%y\ %H:%M\ )
07
08 psql -U monitor -d monitor -c "select * from watch;"
09
10 if [ $? -eq 2 ];
11
12     then
13
14     echo "$time: Database is not accessible! ****************" >> dba.log
15     /usr/local/etc/rc.d/002pgsql.sh start
16     sleep 15
17     psql -U monitor -d monitor -c "select * from watch;"
18
19     if [ $? -eq 0 ];
20
21        then
22
23        echo "$time: Database online!  +++++++++" >> dba.log
24
25     else
26
27        echo "$time: Database: serious error! ***************" >> dba.log
28        echo "$time:         Unable to restart! ****************" >> dba.log
29        while true
30        do
31        psql -U monitor -d monitor -c "select * from watch;"
32
33        if [ $? -eq 0 ];
34
35        then
36
37        time=$(date +%d.%m.%y\ %H:%M\ )
38        echo "$time: Database online! +++++++++" >> dba.log
39        break
40
41        fi
42        sleep 15
43        done
44
45     fi
46
47 fi
48 sleep 15
49
50 done
After starting, the script outputs the log at the console: availability, error, restart, database running.
Figure 1: After starting, the script outputs the log at the console: availability, error, restart, database running.

Printer Restart

The second sample script relates to the printing service. The one shown here is taken from a production example, in which the cupsd server has an unknown problem with a network printer. The printer was disabled time and time again, causing no end of frustration to users and unnecessary work for the system admins. The shell script shown in Listing 2 doesn't output messages; instead, it simply restarts the service. Either run these scripts manually (for a temporary fix or quick check) or as RC scripts.

Listing 2: CUPS Monitoring

01 #! /bin/sh
02
03 while true
04 do
05
06 lpq -Plp | grep -q "lp is ready"
07
08 if [ $? -gt 0 ]
09    then
10    cupsenable lp
11 fi
12
13 sleep 15
14
15 done

Conclusions

Administrators don't need a complex monitoring framework that covers every aspect of the environment and has a multi-week learning curve. With some scripting know-how, you can easily create your own shell scripts to monitor server daemon processes and restart them autonomously if so desired. The use of shell scripts to monitor daemons and other system functions is by no means restricted to small embedded systems. With scripts tailored to match your requirements, you can establish your own troubleshooting arsenal.