Monitoring daemons with shell tools
Watching the Daemons
Unix daemons typically go about their work discreetly in the background. The process table, which is output with the ps
command, only shows you that these willing helpers have been launched, although in the worst case they could just be hanging around as zombies. Whether or not a daemon is actually working is not something that the process table will tell you. In other words, you need more granular diagnostics. The underlying idea is to write a "sensor" script for each service that performs a tangible check of its individual functionality.
Because almost every program outputs standardized exit codes when it terminates, you can use Unix conventions. 0
stands for error-free processing, whereas 1
indicates some problems were encountered. This value is stored in the $?
shell variable, which a shell script evaluates immediately after launching the sensor.
Various programs are suitable for automated, "unmanned" access to the service provided by a given daemon; all of them will run in the shell without a GUI. These programs often provide an option (typically -q
) that suppresses output, and this is fine for accessing the exit code. Error logs are obtainable by redirecting the error output to a file or, if available, by setting the corresponding program option.
The only thing left to do is to find the matching client program test the functionality of each service.
Web Servers
To check a web server, you could use wget
. The shell script command line for this would be:
wget --spider -q ip-address
The --spider
option tells wget
to check that the page exists but not to load it. Defining the IP address instead of the hostname avoids a false positive if DNS-based name resolution fails for some reason.
Almost all known databases include a client program for the shell – for example, mysql
for MySQL or psql
for PostgreSQL. Alternatively, you can use ODBC to access the database in your scripted monitoring, such as the isql
tool provided by the Unix ODBC project.
For ease of access, you might need to set up a (non-privileged) user, a database, and a table for the test query on the database server. If you choose the ODBC option, you also need a .odbc.ini
file with the right access credentials.
The psql
shell client for the Postgres database also poses the problem of non-standard exit codes. 1
stands for an error in the query, although the connect attempt has been successful; 2
indicates a connection error.
A connection test with psql
would look like this:
psql -U User -d Database -c "select * from test_table;"
For ODBC access, you would need to pipe the SQL query to the client:
echo "select * from test_table;" | isql ODBC_data_source user
For the cups
printer daemon, lpq
gives you a simple method of checking whether the daemon is alive. If you need to check access to individual printers, you additionally need to provide the print queue name and then grep
the exit code. To make sure the exit code complies with this behavior, Grep checks the output that you receive if the printer is active:
lpq -Pprinter | grep -q "printer is ready"
To match the output from lpq
, you need to modify the search string for grep.
The ping
command checks network connections. The exit error codes differ, depending on your operating system. The FreeBSD ping uses 2
, the Linux ping uses 1
.
The number of test packets is restricted by the -c packets
option; this improves the script run time and avoids unnecessary network traffic. If you use the IP address as the target, you avoid the risk of false positives from buggy name resolution.
ping -c1 ip_address
Sensor scripts can obviously be extended to cover many other system parameters, such as disk space usage (df
), logged in users (who
), and much, much more.
If an error or threshold value infringement occurs, the script can use this information to generate a message and notify the system administrator. The message text should include the hostname, date, and time. Messages can be stored in a file to which the administrator has permanent access. To allow this to happen, you simply have to display the logfile in a terminal and use tail -f
, but other forms of communication are also possible – texting, for example.
If the shell script has the correct privileges, it can become involved and restart a daemon, remove block files, or even reboot the whole system. Because you should avoid running this kind of script as root, you can instead set up special users and groups to own the script and the process (which is the case with many daemons).
Database Restart
The sample script in Listing 1 monitors an active database instance and notifies the administrator if the database happens to fail and then is successfully restarted (Figure 1). If it can't start the daemon, it waits for the administrator to step in and handle the situation.
Listing 1: Database Monitoring
01 #! /bin/sh 02 03 while true 04 do 05 06 zeit=$(date +%d.%m.%y\ %H:%M\ ) 07 08 psql -U monitor -d monitor -c "select * from watch;" 09 10 if [ $? -eq 2 ]; 11 12 then 13 14 echo "$time: Database is not accessible! ****************" >> dba.log 15 /usr/local/etc/rc.d/002pgsql.sh start 16 sleep 15 17 psql -U monitor -d monitor -c "select * from watch;" 18 19 if [ $? -eq 0 ]; 20 21 then 22 23 echo "$time: Database online! +++++++++" >> dba.log 24 25 else 26 27 echo "$time: Database: serious error! ***************" >> dba.log 28 echo "$time: Unable to restart! ****************" >> dba.log 29 while true 30 do 31 psql -U monitor -d monitor -c "select * from watch;" 32 33 if [ $? -eq 0 ]; 34 35 then 36 37 time=$(date +%d.%m.%y\ %H:%M\ ) 38 echo "$time: Database online! +++++++++" >> dba.log 39 break 40 41 fi 42 sleep 15 43 done 44 45 fi 46 47 fi 48 sleep 15 49 50 done
Printer Restart
The second sample script relates to the printing service. The one shown here is taken from a production example, in which the cupsd
server has an unknown problem with a network printer. The printer was disabled time and time again, causing no end of frustration to users and unnecessary work for the system admins. The shell script shown in Listing 2 doesn't output messages; instead, it simply restarts the service. Either run these scripts manually (for a temporary fix or quick check) or as RC scripts.
Listing 2: CUPS Monitoring
01 #! /bin/sh 02 03 while true 04 do 05 06 lpq -Plp | grep -q "lp is ready" 07 08 if [ $? -gt 0 ] 09 then 10 cupsenable lp 11 fi 12 13 sleep 15 14 15 done
Conclusions
Administrators don't need a complex monitoring framework that covers every aspect of the environment and has a multi-week learning curve. With some scripting know-how, you can easily create your own shell scripts to monitor server daemon processes and restart them autonomously if so desired. The use of shell scripts to monitor daemons and other system functions is by no means restricted to small embedded systems. With scripts tailored to match your requirements, you can establish your own troubleshooting arsenal.