Management Writing Your Own OCF Agent Lead image: © Aleksey Mnogosmyslov, 123RF.com

Monitoring your cluster with a home-grown OCF agent

Personal Agent

Admins who want to leverage the powers of Pacemaker rely on OCF resource agents to monitor the cluster. If you don't have an agent for a specific application, try writing your own. By Martin Loschwitz

The Linux cluster stack includes multiple components – Corosync or Heartbeat handles cluster communication, Pacemaker takes care of the cluster services (known as "resources"), and the storage component is most often DRBD. The resource layer, that is, Pacemaker, contains multiple components that mesh to control the cluster in the best possible way. The unspectacular agent layer in the cluster stack is in fact extremely important for the functionality of the entire cluster.

The cluster can't perform its task perfectly if the resource agents are not working well, and this means that high-quality resource agents are very important. If you don't have a good agent for a specific program you are running on the cluster, it could be worth your while to program your own resource agent.

This article describes how to create a resource agent that is compatible with the Open Cluster Framework (OCF) standards. The stated goal of OCF is to "define standards for clustering APIs." Creating an agent that complies with the OCF standards will maximize the portability of your agent and minimize the late-night troubleshooting for problems on your cluster (see the box titled "More on Agents").

OCF Benefits

What benefits does the OCF standard offer? One important benefit is that administrators have the ability to define the configuration parameters for a resource directly in the Pacemaker CRM configuration. Pacemaker passes the parameters through to the resource agents, which then convert the information into commands. A classic example of this is the resource agent that handles the cluster IP addresses, that is ocf:heartbeat:IPaddr2. This agent is always assigned the ip parameter via the CRM configuration and thus knows which IP is the cluster IP. The filesystem agent works in a similar way; it takes all of the information it needs to mount a filesystem from the CRM configuration. Configuration parameters only work for OCF agents; LSB scripts and Heartbeat agents don't support this feature. You can add parameter lines for them in the CRM shell, but Pacemaker will totally ignore the parameters.

The second major benefit of OCF agents is that they work independently of the distribution on which they are deployed. This means that you can easily migrate a cluster configuration from a Debian system to a SLES system without modifying the configuration. You can even copy the resource agent to another computer as a file and deploy it there. As long as the agent complies with the OCF standard, it will work wherever you want it to, without any hitches. This interoperability also extends to non-Linux systems; theoretically, the cluster stack should also work on the typical BSD derivatives.

Finally, the OCF standard also makes detailed statements about the values an agent has to return to the lrmd. These detailed statements give the cluster administrator a quick overview of the problems if something goes wrong in the cluster.

More on Agents

Pacemaker itself is the crmd, the resource manager for the entire cluster that ensures all of the configured services launch like the doctor ordered. Pacemaker runs on every system, as does lrmd, the Local Resource Management Daemon. lrmd comes courtesy of the cluster-glue package, which you need to install as a prerequisite for Pacemaker. But lrmd itself doesn't launch any programs – this is where the resource agents enter the game; lrmd calls the resource agent when it needs to start a service on a cluster node.

Resource agents typically come from three different categories or "classes": LSB agents are init scripts and reside in /etc/init.d. Heartbeat agents are designed for the now obsolete Heartbeat 1 and are regarded as deprecated. The third and "highest" class is OCF resource agents. The box titled "OCF Benefits" describes some of the advantages achieved through OCF.

Managing your cluster applications through non-standards-compliant scripts, which is theoretically possible using an LSB class in Pacemaker, can quickly cause problems on the cluster. (If you've ever had to repair a Samba cluster in the middle of the night because the Samba init script provided by Debian doesn't support the monitor target correctly, you will know exactly what I mean.) Maintaining OCF compliance can save you a huge amount of trouble.

Preparations

Before you start writing the first line of code [1], it is a good idea to sort out a few details. You should always have the OCF standard on hand; it is available on the web [2]. You will also find it useful to have the current version 1.0.2 [3] of the OCF Resource Agent Developer's Guide, created by Hastexo's Florian Haas. The Developer's Guide gives you an overview of the details that will require your attention. Resource agent authors should also be aware of the parameters that their agent will need to support.

In this article, I will show how to create an OCF resource agent for the Asterisk telephony solution. I was part of a team of developers that created an Asterisk agent just a few weeks ago; the agent applies the approaches described in this article.

The default files in Debian, which typically reside in /etc/default/ and are named after the program for which they provide default settings, give you a good idea of the options that are worth adding to the resource agent (Figure 1).

Figure 1: In Debian systems, the default files provide information on the parameters that a resource agent should support.

In Asterisk's case, this information includes a specifications of the user account and the privileges necessary for running Asterisk. You will also find a ulimit value that defines the number of files that Asterisk is allowed to open at any given time. Debian references these settings in the Asterisk init script, and they can be implemented as parameters in an OCF agent.

Another good reference is the --help command-line option. The list of parameters that the binary itself supports gives the OCF agent author an overview of the supported configuration values.

In this example, the resource agent for Asterisk should support the following parameters: binary specifies where the Asterisk binary resides. canary_binary defines the location of the astcanary binaries, and config shows the location of asterisk.conf. additional_parameters lets you manually define more parameters. realtime defines whether or not Asterisk will run in realtime mode. And, maxfiles sets the maximum number of files that the telephone system is allowed to open at any given time.

Resource Agent Headers

The first line in the OCF resource agent is the same as for any other shell script – the shebang. This line is followed by a generic description for the resource agent and a copyright notice; a list of parameters supported by the resource agent should follow to provide an at-a-glance overview. The header doesn't follow any specific syntax, so the header for the Asterisk agent could look similar to Listing 1.

Listing 1: Resource Agent Header

01 #!/bin/sh
02 #
03 #
04 # Asterisk
05 #
06 # Description:  Manages an Asterisk PBX as an HA resource
07 #
08 # Authors:      Martin Gerhard Loschwitz
09 #               Florian Haas
10 #
11 # Support:      linux-ha@lists.linux-ha.org
12 # License:      GNU General Public License (GPL)
13 #
14 # (c) 2011      hastexo Professional Services GmbH
15 #
16 # This resource agent is losely derived from the MySQL resource
17 # agent, which itself is made available to the public under the
18 # following copyright:
19 #
20 # (c) 2002-2005 International Business Machines, Inc.
21 #     2005-2010 Linux-HA contributors
22 #
23 # See usage() function below for more details ...
24 #
25 # OCF instance parameters:
26 #   OCF_RESKEY_binary
27 #   OCF_RESKEY_canary_binary
28 #   OCF_RESKEY_config
29 #   OCF_RESKEY_user
30 #   OCF_RESKEY_group
31 #   OCF_RESKEY_additional_parameters
32 #   OCF_RESKEY_realtime
33 #   OCF_RESKEY_maxfiles
34 #   OCF_RESKEY_monitor_sipuri
35 ###############################################################

The next step is to initialize all of the OCF-specific functions:

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs

The OCF_ROOT variable, which is referenced at this point, has already been set in the Pacemaker environment the script will run on. From now on, the agent has access to a full set of OCF functions.

Default Values for Configuration Parameters

Each parameter defined for the OCF resource agent can be called in the script itself using the OCF_RESKEY_ prefix. The agent's header contains a reference to this – to call the content of the binary parameter, you just need a reference it to the OCF_RESKEY_binary variable. A good resource agent will set default values for the specified parameters, whenever these settings are meaningful. The second part of the agent is thus similar to Listing 2.

Listing 2: Default Parameter Values

01 OCF_RESKEY_user_default="asterisk"
02 OCF_RESKEY_group_default="asterisk"
03 OCF_RESKEY_binary_default="asterisk"
04 OCF_RESKEY_canary_binary_default="astcanary"
05 OCF_RESKEY_config_default="/etc/asterisk/asterisk.conf"
06 OCF_RESKEY_additional_parameters_default="-g -vvv"
07 OCF_RESKEY_realtime_default="false"
08 OCF_RESKEY_maxfiles_default="8192"
09 : ${OCF_RESKEY_binary=${OCF_RESKEY_binary_default}}
10 : ${OCF_RESKEY_canary_binary=${OCF_RESKEY_canary_binary_default}}
11 : ${OCF_RESKEY_config=${OCF_RESKEY_config_default}}
12 : ${OCF_RESKEY_user=${OCF_RESKEY_user_default}}
13 : ${OCF_RESKEY_group=${OCF_RESKEY_group_default}}
14 : ${OCF_RESKEY_additional_parameters=${OCF_RESKEY_additional_parameters_default}}
15 : ${OCF_RESKEY_realtime=${OCF_RESKEY_realtime_default}}
16 : ${OCF_RESKEY_maxfiles=${OCF_RESKEY_maxfiles_default}}

The approach of first defining a variable with the name of OCF_RESKEY_name_default, which might seem slightly roundabout, is actually well considered: This technique means that the variable will always contain the specified default value, which can then be re-used in the resource agent's metadata later on. If the variable itself were to adopt a specific value at this point, the default would be lost as soon as the administrator set a value.

Usage Function

The usage function tells you which parameters the agent supports. The OCF standard considers this function optional, but it is generally considered good manners to include it. Listing 3 shows a usage function that the Asterisk OCF agent could use. The important thing is that you only state the targets that the agent really supports, and these targets must include at least start, stop, monitor, and meta-data.

Listing 3: Usage Function

01 usage() {
02     cat <<UEND
03         usage: $0 (start|stop|validate-all|meta-data|status|monitor)
04
05         $0 manages an Asterisk PBX as an HA resource.
06
07         The 'start' operation starts the database.
08         The 'stop' operation stops the database.
09         The 'validate-all' operation reports whether the parameters are valid
10         The 'meta-data' operation reports this RA's meta-data information
11         The 'status' operation reports whether the database is running
12         The 'monitor' operation reports whether the database seems to be working
13
14 UEND
15 }

Resource Agent's Metadata

At this point, things start to become tricky for the first time: The resource agent needs to support the meta-data target – Pacemaker relies on the output from the target to find out which parameters the resource agent supports – and which operations it allows.

At the end of the day, the meta-data command (Figure 2) doesn't do anything vastly different from what the usage function does, that is, output text. However, for meta-data, it is important that the output text complies with an XML-based syntax. The output is introduced by the following lines:

The information that crm ra info ocf:heartbeat:pacemaker displays comes from the agent's meta-data function. — Figure 2: The information that `crm ra info ocf:heartbeat:pacemaker` displays comes from the agent's `meta-data` function.

<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="asterisk">
<version>1.0</version>

These lines are followed by a description of the resource agent, which is introduced by the XML <longdesc lang="en"> tag, and a brief description, which is introduced by the <shortdesc lang="en"> tag. Of course, you need to close the tags with </longdesc> or </shortdesc>. <parameters> introduces the section in which a name, a type, and a long and short description are specified for every parameter.

Listing 4 provides an example with the definition of the binary parameter.name defines the name of the parameter, and required defines whether or not the administrator needs to specify the parameter. In the content section, type defines the type of value. The supported types, besides string, are integer for numbers and boolean for classic yes/no choices. In this Asterisk example, maxfiles is an integer type value, and realtime a boolean type value.

Listing 4: Metadata for the Binary Parameter

01 <parameter name="binary" unique="0" required="0">
02 <longdesc lang="en">
03 Location of the Asterisk PBX server binary
04 </longdesc>
05 <shortdesc lang="en">Asterisk PBX server binary</shortdesc>
06 <content type="string" default="${OCF_RESKEY_binary_default}" />
07 </parameter>

Once you have a matching entry for each parameter, use the </parameters> tag to close the parameter list. The agent supports the following agents:

<actions>
<action name="start"timeout="20" />
<actionname="stop" timeout="20" /><action name="status" timeout="20" />
<action name="monitor" timeout="30" interval="20" />
<action name="validate-all" timeout="5" />
<action name="meta-data" timeout="5" />
</actions>

These values define the default time-outs, after which Pacemaker will report an error and the standard interval for the monitor operation. All of these values can be overwritten later in the CRM shell, but it's still a good idea to choose meaningful defaults.

The </resource-agent> entry tells Pacemaker that the agent's metadata output is now complete.

Listing 5 contains a syntactically correct, but incomplete, example of a standards-compliant meta-data function for the Asterisk resource agent. Note that you do need to avoid the use of the "<" and ">" characters in the parameter descriptions – after all, this is XML syntax. The < and > strings provide adequate replacements.

Listing 5: meta-data Function

01 meta_data() {
02 cat <<END
03 <?xml version="1.0"?>
04 <!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
05 <resource-agent name="asterisk">
06 <version>1.0</version>
07
08 <longdesc lang="en">
09 Resource agent for the Asterisk PBX.
10 May manage an Asterisk PBX telephony system or a clone set that
11 forms an Asterisk distributed device setup.
12 </longdesc>
13 <shortdesc lang="en">Manages an Asterisk PBX</shortdesc>
14
15 <parameters>
16
17 <parameter name="binary" unique="0" required="0">
18 <longdesc lang="en">
19 Location of the Asterisk PBX server binary
20 </longdesc>
21 <shortdesc lang="en">Asterisk PBX server binary</shortdesc>
22 <content type="string" default="${OCF_RESKEY_binary_default}" />
23 </parameter>
24
25 [...]
26
27 </parameters>
28
29 <actions>
30 <action name="start" timeout="20" />
31 <action name="stop" timeout="20" />
32 <action name="status" timeout="20" />
33 <action name="monitor" timeout="30" interval="20" />
34 <action name="validate-all" timeout="5" />
35 <action name="meta-data" timeout="5" />
36 </actions>
37 </resource-agent>
38 END
39 }

Validate Function

Each resource agent built to the OCF standard needs a validate function that serves the purpose of discovering whether Asterisk can be launched for the specified configuration. The function needs to check, for example, if the configuration file specified in the parameters actually exists and if the user and group whose privileges Asterisk will use exist on the local system. The resource agent also must ensure that it works itself. If the resource agent relies on binaries whose existence cannot be assumed on some systems, the validate function must check whether the binaries are in place. If necessary, it will issue an error message and quit. Listing 6 contains a validate function that would work for Asterisk – apart from the sipuri section, this function is very generic, so you could modify it for almost any program.

Listing 6: asterisk_validate Function

01 asterisk_validate() {
02     local rc
03
04     check_binary $OCF_RESKEY_binary
05     check_binary pgrep
06
07     if [ -n "$OCF_RESKEY_monitor_sipuri" ]; then
08         check_binary sipsak
09     fi
10
11     if [ ! -f $OCF_RESKEY_config ]; then
12         ocf_log err "Config $OCF_RESKEY_config doesn't exist"
13         return $OCF_ERR_INSTALLED
14     fi
15
16     getent passwd $OCF_RESKEY_user >/dev/null 2>&1
17     rc=$?
18     if [ $rc -ne 0 ]; then
19         ocf_log err "User $OCF_RESKEY_user doesn't exist"
20         return $OCF_ERR_INSTALLED
21     fi
22
23     getent group $OCF_RESKEY_group >/dev/null 2>&1
24     rc=$?
25     if [ $rc -ne 0 ]; then
26         ocf_log err "Group $OCF_RESKEY_group doesn't exist"
27         return $OCF_ERR_INSTALLED
28     fi
29
30     true
31 }

Monitor and Status Operation

One of the very pleasant features of Pacemaker is its ability to check a resource's status; it uses the Monitor operation for checking resource status. When the resource agent is called with the monitor target, it needs to check whether the program will run as defined in the resource configuration. This typically means checking multiple factors. In Asterisk's case, the monitor function is implemented as two functions in the OCF agent. The asterisk_status function checks if a process exists that matches the entry in Asterisk's PID file and whether the astcanary process belonging to it also exists. asterisk_monitor also finds out if the Asterisk process responds to a query via the Asterisk console and checks whether a test request using sipsak works.

Astcanary – A Special Case

Asterisk gives programmers insights into how special cases are handled in status queries. If Asterisk is running in real-time mode, astcanary needs to run in addition to the asterisk process. Every second, astcanary updates the timestamp of a file with the slightly weird name of alt.asterisk.canary.tweet.tweet.tweet, which resides in the Asterisk status folder. Asterisk runs in real-time mode while the timestamp update occurs. If the timestamp update stops, Asterisk reverts to normal mode. If you set the realtime parameter to yes, you need to check the status of astcanary in addition to Asterisk – if Astcanary fails, Asterisk stops running as configured, and the resource agent has to return an error when it checks the status.

Asterisk is also a good example of how to handle files that don't support the definition of their status files (PID file, socket for connections by external programs) as a parameter to the binary. Asterisk simply doesn't have a parameter for this situation. The file containing the PID of an active Asterisk instance is always named asterisk.pid and resides in the Asterisk run directory. The Asterisk configuration file tells you where the run directory is; the variable for this is astrundir. It is advisable to first define a variable that contains the value of astrundir for the Asterisk agent. The definition of the variable occurs at the end of the resource agent and might look like:

ASTRUNDIR=`grep astrundir $OCF_RESKEY_config | awk '/^astrundir/ {print $3}'`

It is a good idea to store the value of astlogdir in a separate variable, since it is needed when the program starts. The definition is similar to that of astrundir:

ASTLOGDIR=`grep astlogdir $OCF_RESKEY_config | awk '/^astlogdir/ {print $3}'`

Status Operation

Listing 7 shows a complete example of a status operation. The program checks whether a PID file exists and whether the PID it mentions really does exist on the system. If this is the case, the program uses OCF_RESKEY_realtime to check whether real-time mode is enabled. If so, the status function uses pgrep to check whether an astcanary exists for the Asterisk instance. (The alt.asterisk.canary.tweet.tweet.tweet file always resides in the Asterisk rundir.) If all of these tests return positive results, the function confirms with OCF_SUCCESS. If Asterisk is not running, the function returns OCF_NOT_RUNNING and removes the PID, if it exists. If Asterisk is running, but Astcanary isn't running despite realtime mode, an error is returned with a value of OCF_ERR_GENERIC. This value tells Pacemaker to restart the resource. For an overview of all OCF return values, see the "OCF Return Values" box.

OCF Return Values

The return value that occurs when an OCF agent is called with a specific parameter defines how Pacemaker evaluates the status of the resource. The most important parameters and their associated return values are:

OCF_SUCESS: The operation completed successfully. The return value is 0. If a resource is started successfully, this return value is fine – it can also be used if a monitor operation is performed and the resource works correctly.

OCF_NOT_RUNNING: The resource is not running. The return value is 7.

OCF_ERR_GENERIC: A generic error has occurred and Pacemaker will attempt to stop and restart the resource. The return value is 1.

In addition to these three parameters are several others; for example, some of the parameters allow for a more precise definition of errors. An exhaustive list is available in Chapter 3 of the OCF Resource Agent Developer's Guide.

Listing 7: asterisk_status Function

01 asterisk_status() {
02     local pid
03     local rc
04
05     if [ ! -f $ASTRUNDIR/asterisk.pid ]; then
06         ocf_log info "Asterisk PBX is not running"
07         return $OCF_NOT_RUNNING
08     fi
09
10     pid=`cat $ASTRUNDIR/asterisk.pid`
11     ocf_run kill -s 0 $pid
12     rc=$?
13
14     if [ $rc -eq 0 ]; then
15         if ocf_is_true "$OCF_RESKEY_realtime"; then
16             astcanary_pid=`pgrep -d " " -f "astcanary $ASTRUNDIR/alt.asterisk.canary.tweet.tweet.tweet"`
17             if [ ! "$astcanary_pid" ]; then
18                 ocf_log err "Asterisk PBX is running but astcanary is not although it should"
19                 return $OCF_ERR_GENERIC
20             fi
21         else
22             return $OCF_SUCCESS
23         fi
24     else
25         ocf_log info "Asterisk PBX not running: removing old PID file"
26         rm -f $ASTRUNDIR/asterisk.pid
27         return $OCF_NOT_RUNNING
28     fi
29 }

Incidentally, the function uses the OCF ocf_run function to call binaries. Compared with a direct call, this approach offers the option of logging all of the command-line output from ocf_run. A direct call would send this output to a black hole, and possible error messages would disappear along with it. If you need to execute an external program, ocf_run is very much recommended. This example uses the OCF ocf_is_true function to query the value of realtime by reference to the OCF_RESKEY_realtime environmental variable. This approach lets administrators enter the shell statements realtime=true and realtime=yes (or the negative counterparts) in the CRM shell. The ocf_is_true function detects all of these parameters and behaves accordingly. Using an if clause to check for the individual values would result in far more code.

Monitor Operation

Unlike the Status operation, the Monitor operation doesn't check whether processes exist, but whether Asterisk is responding to external requests. For other programs, you could use telnet or openssl s_client, for example, to test whether a response is returned for a specific port. The Asterisk binary has a client mode, which you can start with the -c -r options.

A Monitor operation for Asterisk might look like Listing 8. It is no coincidence that the asterisk_monitor function initially calls asterisk_status and quits with an error message if the return value is not OCF_SUCCESS. Based on this principle, the agent later only needs to call the asterisk_monitor function when called with the monitor target. The asterisk_rx function is a helper function defined at the start of the agent (Listing 9).

Listing 8: Monitor Function

01 asterisk_monitor() {
02     local rc
03
04     asterisk_status
05     rc=$?
06
07     # If status returned an error, return that immediately
08     if [ $rc -ne $OCF_SUCCESS ]; then
09         return $rc
10     fi
11
12     # Check whether connecting to asterisk is possible
13     asterisk_rx 'core show channels count'
14     rc=$?
15
16     if [ $rc -ne 0 ]; then
17       ocf_log err "Failed to connect to the Asterisk PBX"
18       return $OCF_ERR_GENERIC
19     fi
20
21     # Optionally check the monitor URI with sipsak
22     # The return values:
23     # 0 means that a 200 was received.
24     # 1 means something else then 1xx or 2xx was received.
25     # 2 will be returned on local errors like non resolvable names
26     #   or wrong options combination.
27     # 3 will be returned on remote errors like socket errors
28     #   (e.g. icmp error), redirects without a contact header or
29     #   simply no answer (timeout).
30     #   This can also happen if sipsak is run too early after asterisk
31     #   start.
32     if [ -n "$OCF_RESKEY_monitor_sipuri" ]; then
33         ocf_run sipsak -s "$OCF_RESKEY_monitor_sipuri"
34         rc=$?
35         case "$rc" in
36           1|2) return $OCF_ERR_GENERIC;;
37           3)   return $OCF_NOT_RUNNING;;
38         esac
39     fi
40
41     ocf_log debug "Asterisk PBX monitor succeeded"
42     return $OCF_SUCCESS
43 }

Listing 9: asterisk_rx Helper Function

01 asterisk_rx() {
02     # if $HOME is set, asterisk -rx writes a .asterisk_history there
03     (
04         unset HOME
05         ocf_run $OCF_RESKEY_binary -r -s $ASTRUNDIR/asterisk.ctl -x "$1"
06     )
07 }

The Monitor function in Listing 8 first checks if Asterisk is running (by calling asterisk_status()) and tells the Asterisk client to issue the core show channels count command at the Asterisk console. If the administrator has entered a SIP URL as the sipuri parameter, the agent attempts to open a SIP connection to this URL using sipsak. Depending on the success of the individual commands, the function returns 0 or an error message.

Start Function

The start function is primarily designed to make sure the Asterisk daemon runs as intended by the CRM configuration. However, it has to do more than this. The OCF standard requires the start function to return OCF_SUCCESS, that is 0, in cases where Asterisk is already running. It wouldn't make any sense to start the daemon again in this case, because it would very likely quit with an error. asterisk_monitor provides a powerful tool for testing whether Asterisk is working – start would ideally call asterisk_monitor and interpret the return value accordingly before the function does anything else.

If Asterisk is not running, the agent needs to start the service. Astcanary again play an important role: It is launched by Asterisk itself so that the Asterisk agent doesn't need to worry about it. But if the Asterisk instance dies, the matching Astcanary will not automatically follow suit. This explains why the start function also has to remove any Astcanary zombies before starting a new Asterisk.

Because agents that comply with the OCF standard must not be distribution-specific, tools such as the Debian start-stop-daemon are not permissible. The agent has to call the Asterisk binary directly. To do so, it constructs an Asterisk call from the various defined parameters and then uses ocf_run to execute it. Before doing so, the agent ensures that the folders Asterisk needs to work actually exist and that the Asterisk user defined in the parameters has write privileges for these folders. Immediately after calling the binary, the start function runs monitor again to check if the start really worked and to ensure that Asterisk is running as intended. Apart from the Astcanary complications, the start function is typical, and you can use it with many other daemons with minor modifications. The entire start function looks like Listing 10.

Listing 10: Start Function

01 asterisk_start() {
02     local asterisk_extra_params
03     local dir
04     local rc
05
06     asterisk_status
07     rc=$?
08     if [ $rc -eq $OCF_SUCCESS ]; then
09         ocf_log info "Asterisk PBX already running"
10         return $OCF_SUCCESS
11     fi
12
13     # If Asterisk is not already running, make sure there is no
14     # old astcanary instance when the new asterisk starts. To
15     # achieve this, kill old astcanary instances belonging to
16     # this $ASTRUNDIR.
17
18     # Find out PIDs of running astcanaries
19     astcanary_pid=`pgrep -d " " -f "astcanary $ASTRUNDIR/alt.asterisk.canary.tweet.tweet.tweet"`
20
21     # If there are astcanaries running that belong to $ASTRUNDIR,
22     # kill them.
23     if [ "$astcanary_pid" ]; then
24       for i in $astcanary_pid; do ocf_run kill -s KILL $astcanary_pid; done
25     fi
26
27     for dir in $ASTRUNDIR $ASTLOGDIR $ASTLOGDIR/cdr-csv $ASTLOGDIR/cdr-custom; do
28         if [ ! -d "$dir" ]; then
29             ocf_run install -d -o $OCF_RESKEY_user -g $OCF_RESKEY_group \
30                $dir || exit $OCF_ERR_GENERIC
31         fi
32         # Regardless of whether we just created the directory or it
33         # already existed, check whether it is writable by the
34         # configured user
35         if ! su -s /bin/sh - $OCF_RESKEY_user -c "test -w $dir"; then
36           ocf_log err "Directory $dir is not writable by $OCF_RESKEY_user"
37             exit $OCF_ERR_PERM
38         fi
39     done
40
41     # set MAXFILES
42     ulimit -n $OCF_RESKEY_maxfiles
43
44     # Determine whether Asterisk PBX is supposed to run in Realtime
45     # mode or not and make asterisk daemonize automatically
46     if ocf_is_true "$OCF_RESKEY_realtime"; then
47       asterisk_extra_params="-F -p"
48     else
49       asterisk_extra_params="-F"
50     fi
51
52     ocf_run ${OCF_RESKEY_binary} -G $OCF_RESKEY_group \
53                 -U $OCF_RESKEY_user -C $OCF_RESKEY_config \
54                 $OCF_RESKEY_additional_parameters \
55                 $asterisk_extra_params
56     rc=$?
57     if [ $rc -ne 0 ]; then
58         ocf_log err "Asterisk PBX start command failed: $rc"
59         exit $OCF_ERR_GENERIC
60     fi
61
62     # Spin waiting for the server to come up.
63     # Let the CRM/LRM time us out if required
64     while true; do
65         asterisk_monitor
66         rc=$?
67         [ $rc -eq $OCF_SUCCESS ] && break
68         if [ $rc -ne $OCF_NOT_RUNNING ]; then
69             ocf_log err "Asterisk PBX start failed"
70             exit $OCF_ERR_GENERIC
71         fi
72         sleep 2
73     done
74
75     ocf_log info "Asterisk PBX started"
76     return $OCF_SUCCESS
77 }

The Stop Function

The whole meaning of a stop function is to make sure that Asterisk definitely isn't running after the call to the agent. At the start of the function is a call to monitor, which returns OCF_SUCCESS if Asterisk is already switched off. Following this, the function uses a four-part approach: The quit command is passed into Asterisk in the Asterisk shell in the form of the core stop now command. This should do the job under normal circumstances. If not, a kill signal is issued via SIGTERM.

If this command doesn't return a 0, it could mean that the kill command was unable to perform its task and that the daemon never received the signal. The agent then exits at this point, issuing an OCF_ERR_GENERIC, because it is very likely that the system is experiencing some major problems. If kill -s TERM doesn't return an error message, this doesn't mean the process has died. The agent checks for an error and will issue a SIGKILL if needed – until the value for the stop function timeout is reached. If the daemon is still running at this point, there is no alternative to manual intervention by the administrator. The asterisk_stop function would look very much like Listing 11.

Listing 11: asterisk_stop Function

01 asterisk_stop() {
02     local pid
03     local astcanary_pid
04     local rc
05
06     asterisk_status
07     rc=$?
08     if [ $rc -eq $OCF_NOT_RUNNING ]; then
09         ocf_log info "Asterisk PBX already stopped"
10         return $OCF_SUCCESS
11     fi
12
13     # do a "soft shutdown" via the asterisk command line first
14     asterisk_rx 'core stop now'
15
16     asterisk_status
17     rc=$?
18     if [ $rc -eq $OCF_NOT_RUNNING ]; then
19         ocf_log info "Asterisk PBX stopped"
20         return $OCF_SUCCESS
21     fi
22
23     # If "core stop now" didn't succeed, try SIGTERM
24     pid=`cat $ASTRUNDIR/asterisk.pid`
25     ocf_run kill -s TERM $pid
26     rc=$?
27     if [ $rc -ne 0 ]; then
28         ocf_log err "Asterisk PBX couldn't be stopped"
29         exit $OCF_ERR_GENERIC
30     fi
31
32     # stop waiting
33     shutdown_timeout=15
34     if [ -n "$OCF_RESKEY_CRM_meta_timeout" ]; then
35         shutdown_timeout=$((($OCF_RESKEY_CRM_meta_timeout/1000)-5))
36     fi
37     count=0
38     while [ $count -lt $shutdown_timeout ]; do
39         asterisk_status
40         rc=$?
41         if [ $rc -eq $OCF_NOT_RUNNING ]; then
42             break
43         fi
44         count=`expr $count + 1`
45         sleep 1
46         ocf_log debug "Asterisk PBX still hasn't stopped yet. Waiting ..."
47     done
48
49     asterisk_status
50     rc=$?
51     if [ $rc -ne $OCF_NOT_RUNNING ]; then
52         # SIGTERM didn't help either, try SIGKILL
53         ocf_log info "Asterisk PBX failed to stop after ${shutdown_timeout}s using SIGTERM. Trying SIGKILL ..."
54         ocf_run kill -s KILL $pid
55     fi
56
57     # After killing asterisk, stop astcanary
58     if ocf_is_true "$OCF_RESKEY_realtime"; then
59       astcanary_pid=`pgrep -d " " -f "astcanary $ASTRUNDIR/alt.asterisk.canary.tweet.tweet.tweet"`
60       if [ "$astcanary_pid" ]; then
61         for i in $astcanary_pid; do ocf_run kill -s KILL $astcanary_pid; done
62       fi
63     fi
64
65     ocf_log info "Asterisk PBX stopped"
66     return $OCF_SUCCESS
67 }

The Resource Agent Body

Thus far, I have defined functions that handle the agent's various tasks. What I still don't have is the part that actually calls these functions when the agent is launched against a specific target.

This part of the program needs to store the definitions for ASTRUNDIR and ASTLOGDIR that I mentioned earlier – or at least, these definitions need to come after the validate function call (if validate fails, it might otherwise be unable to set them at all). The remainder mainly reflects the syntax used by many init scripts – Listing 12 contains a full-fledged example, which would occur at the end of the agent file.

Listing 12: Agent Body

01 case "$1" in
02   meta-data)    meta_data
03                 exit $OCF_SUCCESS;;
04   usage|help)   usage
05                 exit $OCF_SUCCESS;;
06 esac
07
08
09 # Anything except meta-data and help must pass validation
10 asterisk_validate || exit $?
11
12 # Now that validate has passed and we can be sure to be able to read
13 # the config file, set convenience variables
14 ASTRUNDIR=`grep astrundir $OCF_RESKEY_config | awk '/^astrundir/ {print $3}'`
15 ASTLOGDIR=`grep astlogdir $OCF_RESKEY_config | awk '/^astlogdir/ {print $3}'`
16
17 # What kind of method was invoked?
18 case "$1" in
19   start)        asterisk_start;;
20   stop)         asterisk_stop;;
21   status)       asterisk_status;;
22   monitor)      asterisk_monitor;;
23   validate-all) ;;
24   *)            usage
25                 exit $OCF_ERR_UNIMPLEMENTED;;
26 esac

Testing the Resource Agent

It isn't absolutely necessary to install Pacemaker to test whether the agent is working. Because OCF agents are shell scripts, you can call them at the command line (Figure 3). The important thing is that the OCF_ROOT environmental variable is set; on most Linux distributions, the variable needs to point to /usr/lib/ocf.

Figure 3: You can test the complete resource agent at the command line without the need to install Pacemaker.

Additionally, the OCF-specific environmental variables need to be set for the parameters in the shell; for example, OCF_RESKEY_realtime – especially if a parameter is tagged as required, or if no default value exists for it. In that case, call the agent, which is named asterisk in this example, by typing ./asterisk monitor to check the status of the resource. (The agent resides in /usr/lib/ocf/resource.d/heartbeat.) It always makes sense to have a working version of Asterisk or whatever program the agent was written for. start, stop, and all other commands work similarly. If the agent does not work as intended, a call with a prefix of sh -x will probably provide more insights.

If you have a working Pacemaker, you can also define a resource with the agent (in this example, the agent is named ocf:heartbeat:asterisk) and then test it with ocf-tester (Figure 4). Assuming the Asterisk resource is named p_asterisk, the full command for the call to ocf-tester would be:

Figure 4: ocf-testr checks an OCF agent for compliance with the OCF standard.

ocf-tester -n p_asterisk /usr/lib/ocf/resource.d/heartbeat/asterisk

If ocf-tester outputs an error, you know that something is wrong with your work.

Logging with OCF

The code snippets for the individual functions contain various calls to ocf_log with various parameters that define the relevance of the output.

The Severity function of an ocf_log call determines how the log entries are highlighted in the high-availability log so that the administrator can search for them in a targeted way. The Severity function supports the following categories:

debug: Debug information, which is of very little relevance to normal cluster operations. By default, the high-availability stack suppresses these messages on most systems.

info: Generic logging information that doesn't require any direct intervention by the administrator.

warn: Warnings; that is, unexpected events that are not classified as errors.

err: Error messages, following which the OCF agent should output an exit with the corresponding return value.

crit: Critical errors that require immediate administrative intervention.

Conclusions

The OCF standard is a very powerful tool for creating high-quality resource agents for Pacemaker. Because the standard is based on shell scripting, administrators with some scripting skills should find it fairly easy to build an agent.

This article used Asterisk as an example application; however, most of the standard functions are similar for any agent. If you have create a resource agent and would like to make it part of the official cluster-agents package, you can post your code on the Linux-HA Device mailing list [4] as a basis for discussion. Look online for the complete Asterisk resource agent implementation [5].