Monitoring your cluster with a home-grown OCF agent
Personal Agent
The Linux cluster stack includes multiple components – Corosync or Heartbeat handles cluster communication, Pacemaker takes care of the cluster services (known as "resources"), and the storage component is most often DRBD. The resource layer, that is, Pacemaker, contains multiple components that mesh to control the cluster in the best possible way. The unspectacular agent layer in the cluster stack is in fact extremely important for the functionality of the entire cluster.
The cluster can't perform its task perfectly if the resource agents are not working well, and this means that high-quality resource agents are very important. If you don't have a good agent for a specific program you are running on the cluster, it could be worth your while to program your own resource agent.
This article describes how to create a resource agent that is compatible with the Open Cluster Framework (OCF) standards. The stated goal of OCF is to "define standards for clustering APIs." Creating an agent that complies with the OCF standards will maximize the portability of your agent and minimize the late-night troubleshooting for problems on your cluster (see the box titled "More on Agents").
Managing your cluster applications through non-standards-compliant scripts, which is theoretically possible using an LSB class in Pacemaker, can quickly cause problems on the cluster. (If you've ever had to repair a Samba cluster in the middle of the night because the Samba init script provided by Debian doesn't support the monitor
target correctly, you will know exactly what I mean.) Maintaining OCF compliance can save you a huge amount of trouble.
Preparations
Before you start writing the first line of code [1], it is a good idea to sort out a few details. You should always have the OCF standard on hand; it is available on the web [2]. You will also find it useful to have the current version 1.0.2 [3] of the OCF Resource Agent Developer's Guide, created by Hastexo's Florian Haas. The Developer's Guide gives you an overview of the details that will require your attention. Resource agent authors should also be aware of the parameters that their agent will need to support.
In this article, I will show how to create an OCF resource agent for the Asterisk telephony solution. I was part of a team of developers that created an Asterisk agent just a few weeks ago; the agent applies the approaches described in this article.
The default
files in Debian, which typically reside in /etc/default/
and are named after the program for which they provide default settings, give you a good idea of the options that are worth adding to the resource agent (Figure 1).
In Asterisk's case, this information includes a specifications of the user account and the privileges necessary for running Asterisk. You will also find a ulimit
value that defines the number of files that Asterisk is allowed to open at any given time. Debian references these settings in the Asterisk init script, and they can be implemented as parameters in an OCF agent.
Another good reference is the --help
command-line option. The list of parameters that the binary itself supports gives the OCF agent author an overview of the supported configuration values.
In this example, the resource agent for Asterisk should support the following parameters: binary
specifies where the Asterisk binary resides. canary_binary
defines the location of the astcanary
binaries, and config
shows the location of asterisk.conf
. additional_parameters
lets you manually define more parameters. realtime
defines whether or not Asterisk will run in realtime mode. And, maxfiles
sets the maximum number of files that the telephone system is allowed to open at any given time.
Resource Agent Headers
The first line in the OCF resource agent is the same as for any other shell script – the shebang. This line is followed by a generic description for the resource agent and a copyright notice; a list of parameters supported by the resource agent should follow to provide an at-a-glance overview. The header doesn't follow any specific syntax, so the header for the Asterisk agent could look similar to Listing 1.
Listing 1: Resource Agent Header
01 #!/bin/sh 02 # 03 # 04 # Asterisk 05 # 06 # Description: Manages an Asterisk PBX as an HA resource 07 # 08 # Authors: Martin Gerhard Loschwitz 09 # Florian Haas 10 # 11 # Support: linux-ha@lists.linux-ha.org 12 # License: GNU General Public License (GPL) 13 # 14 # (c) 2011 hastexo Professional Services GmbH 15 # 16 # This resource agent is losely derived from the MySQL resource 17 # agent, which itself is made available to the public under the 18 # following copyright: 19 # 20 # (c) 2002-2005 International Business Machines, Inc. 21 # 2005-2010 Linux-HA contributors 22 # 23 # See usage() function below for more details ... 24 # 25 # OCF instance parameters: 26 # OCF_RESKEY_binary 27 # OCF_RESKEY_canary_binary 28 # OCF_RESKEY_config 29 # OCF_RESKEY_user 30 # OCF_RESKEY_group 31 # OCF_RESKEY_additional_parameters 32 # OCF_RESKEY_realtime 33 # OCF_RESKEY_maxfiles 34 # OCF_RESKEY_monitor_sipuri 35 ###############################################################
The next step is to initialize all of the OCF-specific functions:
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat} . ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
The OCF_ROOT
variable, which is referenced at this point, has already been set in the Pacemaker environment the script will run on. From now on, the agent has access to a full set of OCF functions.
Default Values for Configuration Parameters
Each parameter defined for the OCF resource agent can be called in the script itself using the OCF_RESKEY_
prefix. The agent's header contains a reference to this – to call the content of the binary
parameter, you just need a reference it to the OCF_RESKEY_binary
variable. A good resource agent will set default values for the specified parameters, whenever these settings are meaningful. The second part of the agent is thus similar to Listing 2.
Listing 2: Default Parameter Values
01 OCF_RESKEY_user_default="asterisk" 02 OCF_RESKEY_group_default="asterisk" 03 OCF_RESKEY_binary_default="asterisk" 04 OCF_RESKEY_canary_binary_default="astcanary" 05 OCF_RESKEY_config_default="/etc/asterisk/asterisk.conf" 06 OCF_RESKEY_additional_parameters_default="-g -vvv" 07 OCF_RESKEY_realtime_default="false" 08 OCF_RESKEY_maxfiles_default="8192" 09 : ${OCF_RESKEY_binary=${OCF_RESKEY_binary_default}} 10 : ${OCF_RESKEY_canary_binary=${OCF_RESKEY_canary_binary_default}} 11 : ${OCF_RESKEY_config=${OCF_RESKEY_config_default}} 12 : ${OCF_RESKEY_user=${OCF_RESKEY_user_default}} 13 : ${OCF_RESKEY_group=${OCF_RESKEY_group_default}} 14 : ${OCF_RESKEY_additional_parameters=${OCF_RESKEY_additional_parameters_default}} 15 : ${OCF_RESKEY_realtime=${OCF_RESKEY_realtime_default}} 16 : ${OCF_RESKEY_maxfiles=${OCF_RESKEY_maxfiles_default}}
The approach of first defining a variable with the name of OCF_RESKEY_name_default
, which might seem slightly roundabout, is actually well considered: This technique means that the variable will always contain the specified default value, which can then be re-used in the resource agent's metadata later on. If the variable itself were to adopt a specific value at this point, the default would be lost as soon as the administrator set a value.
Usage Function
The usage function tells you which parameters the agent supports. The OCF standard considers this function optional, but it is generally considered good manners to include it. Listing 3 shows a usage function that the Asterisk OCF agent could use. The important thing is that you only state the targets that the agent really supports, and these targets must include at least start
, stop
, monitor
, and meta-data
.
Listing 3: Usage Function
01 usage() { 02 cat <<UEND 03 usage: $0 (start|stop|validate-all|meta-data|status|monitor) 04 05 $0 manages an Asterisk PBX as an HA resource. 06 07 The 'start' operation starts the database. 08 The 'stop' operation stops the database. 09 The 'validate-all' operation reports whether the parameters are valid 10 The 'meta-data' operation reports this RA's meta-data information 11 The 'status' operation reports whether the database is running 12 The 'monitor' operation reports whether the database seems to be working 13 14 UEND 15 }
Resource Agent's Metadata
At this point, things start to become tricky for the first time: The resource agent needs to support the meta-data
target – Pacemaker relies on the output from the target to find out which parameters the resource agent supports – and which operations it allows.
At the end of the day, the meta-data
command (Figure 2) doesn't do anything vastly different from what the usage
function does, that is, output text. However, for meta-data
, it is important that the output text complies with an XML-based syntax. The output is introduced by the following lines:
<?xml version="1.0"?> <!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd"> <resource-agent name="asterisk"> <version>1.0</version>
These lines are followed by a description of the resource agent, which is introduced by the XML <longdesc lang="en">
tag, and a brief description, which is introduced by the <shortdesc lang="en">
tag. Of course, you need to close the tags with </longdesc>
or </shortdesc>
. <parameters>
introduces the section in which a name, a type, and a long and short description are specified for every parameter.
Listing 4 provides an example with the definition of the binary
parameter.name
defines the name of the parameter, and required
defines whether or not the administrator needs to specify the parameter. In the content
section, type
defines the type of value. The supported types, besides string
, are integer
for numbers and boolean
for classic yes/no choices. In this Asterisk example, maxfiles
is an integer
type value, and realtime
a boolean
type value.
Listing 4: Metadata for the Binary Parameter
01 <parameter name="binary" unique="0" required="0"> 02 <longdesc lang="en"> 03 Location of the Asterisk PBX server binary 04 </longdesc> 05 <shortdesc lang="en">Asterisk PBX server binary</shortdesc> 06 <content type="string" default="${OCF_RESKEY_binary_default}" /> 07 </parameter>
Once you have a matching entry for each parameter, use the </parameters>
tag to close the parameter list. The agent supports the following agents:
<actions> <action name="start"timeout="20" /> <actionname="stop" timeout="20" /><action name="status" timeout="20" /> <action name="monitor" timeout="30" interval="20" /> <action name="validate-all" timeout="5" /> <action name="meta-data" timeout="5" /> </actions>
These values define the default time-outs, after which Pacemaker will report an error and the standard interval for the monitor operation. All of these values can be overwritten later in the CRM shell, but it's still a good idea to choose meaningful defaults.
The </resource-agent>
entry tells Pacemaker that the agent's metadata output is now complete.
Listing 5 contains a syntactically correct, but incomplete, example of a standards-compliant meta-data
function for the Asterisk resource agent. Note that you do need to avoid the use of the "<" and ">" characters in the parameter descriptions – after all, this is XML syntax. The <
and >
strings provide adequate replacements.
Listing 5: meta-data Function
01 meta_data() { 02 cat <<END 03 <?xml version="1.0"?> 04 <!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd"> 05 <resource-agent name="asterisk"> 06 <version>1.0</version> 07 08 <longdesc lang="en"> 09 Resource agent for the Asterisk PBX. 10 May manage an Asterisk PBX telephony system or a clone set that 11 forms an Asterisk distributed device setup. 12 </longdesc> 13 <shortdesc lang="en">Manages an Asterisk PBX</shortdesc> 14 15 <parameters> 16 17 <parameter name="binary" unique="0" required="0"> 18 <longdesc lang="en"> 19 Location of the Asterisk PBX server binary 20 </longdesc> 21 <shortdesc lang="en">Asterisk PBX server binary</shortdesc> 22 <content type="string" default="${OCF_RESKEY_binary_default}" /> 23 </parameter> 24 25 [...] 26 27 </parameters> 28 29 <actions> 30 <action name="start" timeout="20" /> 31 <action name="stop" timeout="20" /> 32 <action name="status" timeout="20" /> 33 <action name="monitor" timeout="30" interval="20" /> 34 <action name="validate-all" timeout="5" /> 35 <action name="meta-data" timeout="5" /> 36 </actions> 37 </resource-agent> 38 END 39 }
Validate Function
Each resource agent built to the OCF standard needs a validate function that serves the purpose of discovering whether Asterisk can be launched for the specified configuration. The function needs to check, for example, if the configuration file specified in the parameters actually exists and if the user and group whose privileges Asterisk will use exist on the local system. The resource agent also must ensure that it works itself. If the resource agent relies on binaries whose existence cannot be assumed on some systems, the validate function must check whether the binaries are in place. If necessary, it will issue an error message and quit. Listing 6 contains a validate function that would work for Asterisk – apart from the sipuri
section, this function is very generic, so you could modify it for almost any program.
Listing 6: asterisk_validate Function
01 asterisk_validate() { 02 local rc 03 04 check_binary $OCF_RESKEY_binary 05 check_binary pgrep 06 07 if [ -n "$OCF_RESKEY_monitor_sipuri" ]; then 08 check_binary sipsak 09 fi 10 11 if [ ! -f $OCF_RESKEY_config ]; then 12 ocf_log err "Config $OCF_RESKEY_config doesn't exist" 13 return $OCF_ERR_INSTALLED 14 fi 15 16 getent passwd $OCF_RESKEY_user >/dev/null 2>&1 17 rc=$? 18 if [ $rc -ne 0 ]; then 19 ocf_log err "User $OCF_RESKEY_user doesn't exist" 20 return $OCF_ERR_INSTALLED 21 fi 22 23 getent group $OCF_RESKEY_group >/dev/null 2>&1 24 rc=$? 25 if [ $rc -ne 0 ]; then 26 ocf_log err "Group $OCF_RESKEY_group doesn't exist" 27 return $OCF_ERR_INSTALLED 28 fi 29 30 true 31 }
Monitor and Status Operation
One of the very pleasant features of Pacemaker is its ability to check a resource's status; it uses the Monitor
operation for checking resource status. When the resource agent is called with the monitor
target, it needs to check whether the program will run as defined in the resource configuration. This typically means checking multiple factors. In Asterisk's case, the monitor function is implemented as two functions in the OCF agent. The asterisk_status
function checks if a process exists that matches the entry in Asterisk's PID file and whether the astcanary
process belonging to it also exists. asterisk_monitor
also finds out if the Asterisk process responds to a query via the Asterisk console and checks whether a test request using sipsak
works.
Astcanary – A Special Case
Asterisk gives programmers insights into how special cases are handled in status queries. If Asterisk is running in real-time mode, astcanary
needs to run in addition to the asterisk
process. Every second, astcanary
updates the timestamp of a file with the slightly weird name of alt.asterisk.canary.tweet.tweet.tweet
, which resides in the Asterisk status folder. Asterisk runs in real-time mode while the timestamp update occurs. If the timestamp update stops, Asterisk reverts to normal mode. If you set the realtime
parameter to yes
, you need to check the status of astcanary
in addition to Asterisk – if Astcanary fails, Asterisk stops running as configured, and the resource agent has to return an error when it checks the status.
Asterisk is also a good example of how to handle files that don't support the definition of their status files (PID file, socket for connections by external programs) as a parameter to the binary. Asterisk simply doesn't have a parameter for this situation. The file containing the PID of an active Asterisk instance is always named asterisk.pid
and resides in the Asterisk run directory. The Asterisk configuration file tells you where the run directory is; the variable for this is astrundir
. It is advisable to first define a variable that contains the value of astrundir
for the Asterisk agent. The definition of the variable occurs at the end of the resource agent and might look like:
ASTRUNDIR=`grep astrundir $OCF_RESKEY_config | awk '/^astrundir/ {print $3}'`
It is a good idea to store the value of astlogdir
in a separate variable, since it is needed when the program starts. The definition is similar to that of astrundir
:
ASTLOGDIR=`grep astlogdir $OCF_RESKEY_config | awk '/^astlogdir/ {print $3}'`
Status Operation
Listing 7 shows a complete example of a status operation. The program checks whether a PID file exists and whether the PID it mentions really does exist on the system. If this is the case, the program uses OCF_RESKEY_realtime
to check whether real-time mode is enabled. If so, the status function uses pgrep
to check whether an astcanary exists for the Asterisk instance. (The alt.asterisk.canary.tweet.tweet.tweet
file always resides in the Asterisk rundir
.) If all of these tests return positive results, the function confirms with OCF_SUCCESS
. If Asterisk is not running, the function returns OCF_NOT_RUNNING
and removes the PID, if it exists. If Asterisk is running, but Astcanary isn't running despite realtime mode, an error is returned with a value of OCF_ERR_GENERIC
. This value tells Pacemaker to restart the resource. For an overview of all OCF return values, see the "OCF Return Values" box.
Listing 7: asterisk_status Function
01 asterisk_status() { 02 local pid 03 local rc 04 05 if [ ! -f $ASTRUNDIR/asterisk.pid ]; then 06 ocf_log info "Asterisk PBX is not running" 07 return $OCF_NOT_RUNNING 08 fi 09 10 pid=`cat $ASTRUNDIR/asterisk.pid` 11 ocf_run kill -s 0 $pid 12 rc=$? 13 14 if [ $rc -eq 0 ]; then 15 if ocf_is_true "$OCF_RESKEY_realtime"; then 16 astcanary_pid=`pgrep -d " " -f "astcanary $ASTRUNDIR/alt.asterisk.canary.tweet.tweet.tweet"` 17 if [ ! "$astcanary_pid" ]; then 18 ocf_log err "Asterisk PBX is running but astcanary is not although it should" 19 return $OCF_ERR_GENERIC 20 fi 21 else 22 return $OCF_SUCCESS 23 fi 24 else 25 ocf_log info "Asterisk PBX not running: removing old PID file" 26 rm -f $ASTRUNDIR/asterisk.pid 27 return $OCF_NOT_RUNNING 28 fi 29 }
Incidentally, the function uses the OCF ocf_run
function to call binaries. Compared with a direct call, this approach offers the option of logging all of the command-line output from ocf_run
. A direct call would send this output to a black hole, and possible error messages would disappear along with it. If you need to execute an external program, ocf_run
is very much recommended. This example uses the OCF ocf_is_true
function to query the value of realtime
by reference to the OCF_RESKEY_realtime
environmental variable. This approach lets administrators enter the shell statements realtime=true
and realtime=yes
(or the negative counterparts) in the CRM shell. The ocf_is_true
function detects all of these parameters and behaves accordingly. Using an if
clause to check for the individual values would result in far more code.
Monitor Operation
Unlike the Status operation, the Monitor operation doesn't check whether processes exist, but whether Asterisk is responding to external requests. For other programs, you could use telnet
or openssl s_client,
for example, to test whether a response is returned for a specific port. The Asterisk binary has a client mode, which you can start with the -c -r
options.
A Monitor operation for Asterisk might look like Listing 8. It is no coincidence that the asterisk_monitor
function initially calls asterisk_status
and quits with an error message if the return value is not OCF_SUCCESS
. Based on this principle, the agent later only needs to call the asterisk_monitor
function when called with the monitor
target. The asterisk_rx
function is a helper function defined at the start of the agent (Listing 9).
Listing 8: Monitor Function
01 asterisk_monitor() { 02 local rc 03 04 asterisk_status 05 rc=$? 06 07 # If status returned an error, return that immediately 08 if [ $rc -ne $OCF_SUCCESS ]; then 09 return $rc 10 fi 11 12 # Check whether connecting to asterisk is possible 13 asterisk_rx 'core show channels count' 14 rc=$? 15 16 if [ $rc -ne 0 ]; then 17 ocf_log err "Failed to connect to the Asterisk PBX" 18 return $OCF_ERR_GENERIC 19 fi 20 21 # Optionally check the monitor URI with sipsak 22 # The return values: 23 # 0 means that a 200 was received. 24 # 1 means something else then 1xx or 2xx was received. 25 # 2 will be returned on local errors like non resolvable names 26 # or wrong options combination. 27 # 3 will be returned on remote errors like socket errors 28 # (e.g. icmp error), redirects without a contact header or 29 # simply no answer (timeout). 30 # This can also happen if sipsak is run too early after asterisk 31 # start. 32 if [ -n "$OCF_RESKEY_monitor_sipuri" ]; then 33 ocf_run sipsak -s "$OCF_RESKEY_monitor_sipuri" 34 rc=$? 35 case "$rc" in 36 1|2) return $OCF_ERR_GENERIC;; 37 3) return $OCF_NOT_RUNNING;; 38 esac 39 fi 40 41 ocf_log debug "Asterisk PBX monitor succeeded" 42 return $OCF_SUCCESS 43 }
Listing 9: asterisk_rx Helper Function
01 asterisk_rx() { 02 # if $HOME is set, asterisk -rx writes a .asterisk_history there 03 ( 04 unset HOME 05 ocf_run $OCF_RESKEY_binary -r -s $ASTRUNDIR/asterisk.ctl -x "$1" 06 ) 07 }
The Monitor function in Listing 8 first checks if Asterisk is running (by calling asterisk_status()
) and tells the Asterisk client to issue the core show channels count
command at the Asterisk console. If the administrator has entered a SIP URL as the sipuri
parameter, the agent attempts to open a SIP connection to this URL using sipsak
. Depending on the success of the individual commands, the function returns 0
or an error message.
Start Function
The start
function is primarily designed to make sure the Asterisk daemon runs as intended by the CRM configuration. However, it has to do more than this. The OCF standard requires the start
function to return OCF_SUCCESS
, that is 0
, in cases where Asterisk is already running. It wouldn't make any sense to start the daemon again in this case, because it would very likely quit with an error. asterisk_monitor
provides a powerful tool for testing whether Asterisk is working – start
would ideally call asterisk_monitor
and interpret the return value accordingly before the function does anything else.
If Asterisk is not running, the agent needs to start the service. Astcanary again play an important role: It is launched by Asterisk itself so that the Asterisk agent doesn't need to worry about it. But if the Asterisk instance dies, the matching Astcanary will not automatically follow suit. This explains why the start
function also has to remove any Astcanary zombies before starting a new Asterisk.
Because agents that comply with the OCF standard must not be distribution-specific, tools such as the Debian start-stop-daemon
are not permissible. The agent has to call the Asterisk binary directly. To do so, it constructs an Asterisk call from the various defined parameters and then uses ocf_run
to execute it. Before doing so, the agent ensures that the folders Asterisk needs to work actually exist and that the Asterisk user defined in the parameters has write privileges for these folders. Immediately after calling the binary, the start
function runs monitor
again to check if the start really worked and to ensure that Asterisk is running as intended. Apart from the Astcanary complications, the start
function is typical, and you can use it with many other daemons with minor modifications. The entire start
function looks like Listing 10.
Listing 10: Start Function
01 asterisk_start() { 02 local asterisk_extra_params 03 local dir 04 local rc 05 06 asterisk_status 07 rc=$? 08 if [ $rc -eq $OCF_SUCCESS ]; then 09 ocf_log info "Asterisk PBX already running" 10 return $OCF_SUCCESS 11 fi 12 13 # If Asterisk is not already running, make sure there is no 14 # old astcanary instance when the new asterisk starts. To 15 # achieve this, kill old astcanary instances belonging to 16 # this $ASTRUNDIR. 17 18 # Find out PIDs of running astcanaries 19 astcanary_pid=`pgrep -d " " -f "astcanary $ASTRUNDIR/alt.asterisk.canary.tweet.tweet.tweet"` 20 21 # If there are astcanaries running that belong to $ASTRUNDIR, 22 # kill them. 23 if [ "$astcanary_pid" ]; then 24 for i in $astcanary_pid; do ocf_run kill -s KILL $astcanary_pid; done 25 fi 26 27 for dir in $ASTRUNDIR $ASTLOGDIR $ASTLOGDIR/cdr-csv $ASTLOGDIR/cdr-custom; do 28 if [ ! -d "$dir" ]; then 29 ocf_run install -d -o $OCF_RESKEY_user -g $OCF_RESKEY_group \ 30 $dir || exit $OCF_ERR_GENERIC 31 fi 32 # Regardless of whether we just created the directory or it 33 # already existed, check whether it is writable by the 34 # configured user 35 if ! su -s /bin/sh - $OCF_RESKEY_user -c "test -w $dir"; then 36 ocf_log err "Directory $dir is not writable by $OCF_RESKEY_user" 37 exit $OCF_ERR_PERM 38 fi 39 done 40 41 # set MAXFILES 42 ulimit -n $OCF_RESKEY_maxfiles 43 44 # Determine whether Asterisk PBX is supposed to run in Realtime 45 # mode or not and make asterisk daemonize automatically 46 if ocf_is_true "$OCF_RESKEY_realtime"; then 47 asterisk_extra_params="-F -p" 48 else 49 asterisk_extra_params="-F" 50 fi 51 52 ocf_run ${OCF_RESKEY_binary} -G $OCF_RESKEY_group \ 53 -U $OCF_RESKEY_user -C $OCF_RESKEY_config \ 54 $OCF_RESKEY_additional_parameters \ 55 $asterisk_extra_params 56 rc=$? 57 if [ $rc -ne 0 ]; then 58 ocf_log err "Asterisk PBX start command failed: $rc" 59 exit $OCF_ERR_GENERIC 60 fi 61 62 # Spin waiting for the server to come up. 63 # Let the CRM/LRM time us out if required 64 while true; do 65 asterisk_monitor 66 rc=$? 67 [ $rc -eq $OCF_SUCCESS ] && break 68 if [ $rc -ne $OCF_NOT_RUNNING ]; then 69 ocf_log err "Asterisk PBX start failed" 70 exit $OCF_ERR_GENERIC 71 fi 72 sleep 2 73 done 74 75 ocf_log info "Asterisk PBX started" 76 return $OCF_SUCCESS 77 }
The Stop Function
The whole meaning of a stop
function is to make sure that Asterisk definitely isn't running after the call to the agent. At the start of the function is a call to monitor
, which returns OCF_SUCCESS
if Asterisk is already switched off. Following this, the function uses a four-part approach: The quit
command is passed into Asterisk in the Asterisk shell in the form of the core stop now
command. This should do the job under normal circumstances. If not, a kill signal is issued via SIGTERM
.
If this command doesn't return a 0
, it could mean that the kill
command was unable to perform its task and that the daemon never received the signal. The agent then exits at this point, issuing an OCF_ERR_GENERIC
, because it is very likely that the system is experiencing some major problems. If kill -s TERM
doesn't return an error message, this doesn't mean the process has died. The agent checks for an error and will issue a SIGKILL
if needed – until the value for the stop
function timeout is reached. If the daemon is still running at this point, there is no alternative to manual intervention by the administrator. The asterisk_stop
function would look very much like Listing 11.
Listing 11: asterisk_stop Function
01 asterisk_stop() { 02 local pid 03 local astcanary_pid 04 local rc 05 06 asterisk_status 07 rc=$? 08 if [ $rc -eq $OCF_NOT_RUNNING ]; then 09 ocf_log info "Asterisk PBX already stopped" 10 return $OCF_SUCCESS 11 fi 12 13 # do a "soft shutdown" via the asterisk command line first 14 asterisk_rx 'core stop now' 15 16 asterisk_status 17 rc=$? 18 if [ $rc -eq $OCF_NOT_RUNNING ]; then 19 ocf_log info "Asterisk PBX stopped" 20 return $OCF_SUCCESS 21 fi 22 23 # If "core stop now" didn't succeed, try SIGTERM 24 pid=`cat $ASTRUNDIR/asterisk.pid` 25 ocf_run kill -s TERM $pid 26 rc=$? 27 if [ $rc -ne 0 ]; then 28 ocf_log err "Asterisk PBX couldn't be stopped" 29 exit $OCF_ERR_GENERIC 30 fi 31 32 # stop waiting 33 shutdown_timeout=15 34 if [ -n "$OCF_RESKEY_CRM_meta_timeout" ]; then 35 shutdown_timeout=$((($OCF_RESKEY_CRM_meta_timeout/1000)-5)) 36 fi 37 count=0 38 while [ $count -lt $shutdown_timeout ]; do 39 asterisk_status 40 rc=$? 41 if [ $rc -eq $OCF_NOT_RUNNING ]; then 42 break 43 fi 44 count=`expr $count + 1` 45 sleep 1 46 ocf_log debug "Asterisk PBX still hasn't stopped yet. Waiting ..." 47 done 48 49 asterisk_status 50 rc=$? 51 if [ $rc -ne $OCF_NOT_RUNNING ]; then 52 # SIGTERM didn't help either, try SIGKILL 53 ocf_log info "Asterisk PBX failed to stop after ${shutdown_timeout}s using SIGTERM. Trying SIGKILL ..." 54 ocf_run kill -s KILL $pid 55 fi 56 57 # After killing asterisk, stop astcanary 58 if ocf_is_true "$OCF_RESKEY_realtime"; then 59 astcanary_pid=`pgrep -d " " -f "astcanary $ASTRUNDIR/alt.asterisk.canary.tweet.tweet.tweet"` 60 if [ "$astcanary_pid" ]; then 61 for i in $astcanary_pid; do ocf_run kill -s KILL $astcanary_pid; done 62 fi 63 fi 64 65 ocf_log info "Asterisk PBX stopped" 66 return $OCF_SUCCESS 67 }
The Resource Agent Body
Thus far, I have defined functions that handle the agent's various tasks. What I still don't have is the part that actually calls these functions when the agent is launched against a specific target.
This part of the program needs to store the definitions for ASTRUNDIR
and ASTLOGDIR
that I mentioned earlier – or at least, these definitions need to come after the validate function call (if validate
fails, it might otherwise be unable to set them at all). The remainder mainly reflects the syntax used by many init
scripts – Listing 12 contains a full-fledged example, which would occur at the end of the agent file.
Listing 12: Agent Body
01 case "$1" in 02 meta-data) meta_data 03 exit $OCF_SUCCESS;; 04 usage|help) usage 05 exit $OCF_SUCCESS;; 06 esac 07 08 09 # Anything except meta-data and help must pass validation 10 asterisk_validate || exit $? 11 12 # Now that validate has passed and we can be sure to be able to read 13 # the config file, set convenience variables 14 ASTRUNDIR=`grep astrundir $OCF_RESKEY_config | awk '/^astrundir/ {print $3}'` 15 ASTLOGDIR=`grep astlogdir $OCF_RESKEY_config | awk '/^astlogdir/ {print $3}'` 16 17 # What kind of method was invoked? 18 case "$1" in 19 start) asterisk_start;; 20 stop) asterisk_stop;; 21 status) asterisk_status;; 22 monitor) asterisk_monitor;; 23 validate-all) ;; 24 *) usage 25 exit $OCF_ERR_UNIMPLEMENTED;; 26 esac
Testing the Resource Agent
It isn't absolutely necessary to install Pacemaker to test whether the agent is working. Because OCF agents are shell scripts, you can call them at the command line (Figure 3). The important thing is that the OCF_ROOT
environmental variable is set; on most Linux distributions, the variable needs to point to /usr/lib/ocf
.
Additionally, the OCF-specific environmental variables need to be set for the parameters in the shell; for example, OCF_RESKEY_realtime
– especially if a parameter is tagged as required
, or if no default value exists for it. In that case, call the agent, which is named asterisk
in this example, by typing ./asterisk monitor
to check the status of the resource. (The agent resides in /usr/lib/ocf/resource.d/heartbeat
.) It always makes sense to have a working version of Asterisk or whatever program the agent was written for. start
, stop
, and all other commands work similarly. If the agent does not work as intended, a call with a prefix of sh -x
will probably provide more insights.
If you have a working Pacemaker, you can also define a resource with the agent (in this example, the agent is named ocf:heartbeat:asterisk
) and then test it with ocf-tester
(Figure 4). Assuming the Asterisk resource is named p_asterisk
, the full command for the call to ocf-tester would be:
ocf-tester -n p_asterisk /usr/lib/ocf/resource.d/heartbeat/asterisk
If ocf-tester
outputs an error, you know that something is wrong with your work.
Conclusions
The OCF standard is a very powerful tool for creating high-quality resource agents for Pacemaker. Because the standard is based on shell scripting, administrators with some scripting skills should find it fairly easy to build an agent.
This article used Asterisk as an example application; however, most of the standard functions are similar for any agent. If you have create a resource agent and would like to make it part of the official cluster-agents
package, you can post your code on the Linux-HA Device mailing list [4] as a basis for discussion. Look online for the complete Asterisk resource agent implementation [5].