Introduction to behavior-driven monitoring
Fresh Approach
Any system administrator will be familiar with the situation: You monitor your system comprehensively, all of the services appear to work, and yet you see some anomalies in the graph for payment transactions. The access times for the payment systems have risen by a factor of 10. Manual testing of the components doesn't show any issues, and testing the metrics of the payment interface to the external service provider doesn't report any errors either.
After some intensive research, you finally find the evildoer: A bug in the middleware, which was updated last night, is preventing trouble-free fulfillment of payments. You roll back the release, and performance problems affecting payments are a thing of the past.
Agile Methods
To prevent incidents like this from becoming the norm, software developers often work with agile methods that have a considerable effect on the development process.
With this approach, the focus of development is no longer exclusively on programming activities but on comprehensive testing of the software as well as close cooperation with all of the project stakeholders. These working principles form the basis of "Test-Driven Development" and "Behavior-Driven Development."
The main idea behind test-driven development is to ensure that software doesn't just work but also behaves exactly the way the user expects it to work. To achieve this, automated test systems such as Jenkins [1] (formerly Hudson) or CruiseControl [2] are deployed to test specific features of the application regularly.
Behavior-driven development extends the principles of test-driven development by allowing non-programmers to participate in the process of software development. The focus is on the involvement of all project stakeholders.
To give the project stakeholders the best possible visibility into how the application works, stakeholders additionally must be able to contribute their own scenarios for automated testing.
This behavior-driven approach then allows for the use of a domain-specific language (DSL), which is oriented on a natural language and should not be viewed as a programming language.
Cucumber-Nagios [3] applies both of these principles and supports their application to the infrastructure. This means not just monitoring from an object's point of view (the application is running) but also considering the process that leads to the desired results (the application is performing as desired).
At the same time, all members of the team can contribute their own checks to the monitoring system without the administrator having to writing special functions. There is a buzzword for this, too: "Behavior-Driven Infrastructure."
Taking Stock
Cucumber-Nagios is integrated as a plugin for any open source monitoring system that can interpret the values returned by Nagios plugins.
This tool is based on Cucumber [4] – a popular piece of software for automated testing of Ruby, Java, .NET, Flex, and web applications – and it uses the Gherkin DSL [5].
Down in the engine room, you will also find Webrat, a low-resource browser simulator used for test scenarios involving HTTP. You will also find the Mechanize library [6], which contributes functions for automated interactions on websites. Additionally, thanks to the Ruby Net::SSH [7] implementation of the SSH2 client protocol, SSH support is also on board.
Terminology
Cucumber-Nagios terminology is comparatively simple and consists of projects, features, and steps. A project contains the complete structure that is required to write and execute test scenarios. For example, you can create a project for performing HTTP checks against a website and to offload DNS checks into another, separate project. However, you could just as easily keep all of the services you want to check in a single project.
A project is made up of features and steps. The features contained in the test scenarios are written in the Gherkin DSL, whereas the steps are short blocks of code written in Ruby. These blocks contain the application logic for logging into a web interface or opening a connection to an external server.
To keep things simple, many steps are integrated into Cucumber-Nagios, which lets administrators focus directly on the services they need to monitor. These prebuilt steps include checks for and using DNS, HTTP, ICMP, and SSH. Steps for simple file operations, executing commands on external hosts, and monitoring AMQP protocol-based applications are also included by default.
If necessary, you also can define custom steps for any defined feature to cope with the complexity of your own environment.
An Initial Project
Before you can install Cucumber-Nagios, you need Ruby 1.8.7 and Ruby's own package manager RubyGems (1.3.5 or newer) in place. Once Ruby and RubyGems are ready for action, you can install Cucumber-Nagios and all its dependencies by becoming root and executing:
gem install cucumber-nagios
Like many other applications written in Ruby, Cucumber-Nagios includes its own generator for the most frequently used functions. To create your own initial project, you can just type:
cucumber-nagios-gen project checks
Then, you can execute gem bundle
in the project directory. Doing so lets you copy the project folder to any server without needing to install Cucumber-Nagios and its dependencies locally.
Defining Features
You can use Cucumber-Nagios's own generator again to create the basic framework of four new features. To do this, type
cucumber-nagios-gen feature www.xing.com startpage
in the project directory; this step creates two new files (Listing 1).
Listing 1: New Features of Integrated Generator
01 cucumber-nagios-gen feature www.xing.com startpage 02 Generating with feature generator: 03 [ADDED] features/www.xing.com/startpage.feature.feature 04 [ADDED] features/www.xing.com/steps/startpage.feature_steps.rb
The features/www.xing.com/startpage.feature
file contains the test scenario. You can add your own steps to features/www.xing.com/steps/startpage_steps.rb
as needed.
The startpage.features
is a rudimentary example that you can modify as needed. The syntax used is similar to a natural language, and it is easily mastered without knowledge of a programming language. The program logic is initially hidden in the steps.
All of the features share the first three lines. The first two lines give the feature name and a short intuitive description of the status to expect. The description can take up multiple lines.
In Listing 2, the feature stipulates that http://www.xing.com must be available. The third line uses the Scenario:
keyword to introduce the test scenario; any other lines are instructions for corresponding actions, which the service check must execute and evaluate.
Listing 2: Availability of www.xing.com
01 Feature: www.xing.com 02 It should be reachable 03 04 Scenario: Visiting the website 05 When I go to http://www.xing.com/ 06 Then the request should succeed
Basically, two actions occurred in Listing 2. The When I go to http://www.xing.com/ line defines the initial action, and the anticipated results are described in the term Then the request should succeed.
These two pieces of information are all it takes to generate a working and successful track for the availability of http://www.xing.com. Putting the And
keyword at the start of the new line allows you to add additional conditions for a successful check. The But
keyword is used if you want to define exclusive criteria, which makes sense if you need to verify the content of a page. So, in the context of Listing 2, the following two conditions would make sense if you want to ensure that you really are on the Xing welcome page: And I should see "Join XING for free"
, and But I should not see "Welcome to Facebook"
.
The Given
, When
, Then
, And
, and But
keywords complete the list of supported instructions. Note that Cucumber does not distinguish between the instructions from a technical point of view; however, sticking to the existing conventions is useful.
The Given
keyword is only used if a defined status is assumed, as in:
Given I am on http://www.xing.com/
In contrast, When
typically describes an action to be executed, such as interaction with a website and checking for the existence of a specific file. The following phrase would be conceivable in the context of the previous example:
When I fill in "username-field" with "username"
The And
keyword is a replacement for other Given
, When
, or Then
instructions just to keep the scenario more intelligible. I could add the following instructions to the example:
And I fill in "password-field" with "password" And I press "login-button"
Then
is used to monitor the results. In this case, I am interested in whether or not the login works:
Then I should see "What's new in your network"
The But
keyword is very similar to And
but defines exclusion criteria:
But I should not see "Join XING for free"
Listing 3 shows a feature for checking a login to http://www.xing.com/.
Listing 3: Login Check
01 ./features/www.xing.com/login.feature 02 Feature: www.xing.com 03 I should be able to login on http://www.xing.com/ 04 05 Scenario: Logging in 06 Given I am on http://www.xing.com/ 07 When I fill in "username-field" with "username" 08 And I fill in "password-field" with "password" 09 And I press "login-button" 10 Then I should see "What's new in your network" 11 But I should not see "Join XING for free"
In many scenarios, the concern is not only whether something works but how long a request takes to process. Websites are expected to provide the best possible performance. Performance monitoring with Cucumber-Nagios is possible with the predefined Given I am benchmarking
method. Here is an example:
Scenario: Benchmarking home page Given I am benchmarking When I go to http://www.xing.com/ Then the request should succeed And the elapsed time should be less than 1 seconds
In this case, two elements are checked: whether the HTTP status code is positive and whether the request run time is less than one second. If one of these conditions is not fulfilled, the service is classified as critical, and a warning might be issued depending on the monitoring system configuration.
Steps in Detail
The hard work is done by the prebuilt steps in the background. These steps are short blocks of Ruby code that provide the functionality for the checks. You can use a recursive grep
in the project to create an overview of the steps organized by method. These predefined steps are a solid basis on which you can build your own:
grep -R '[WT]hen\|Given' features/steps/| awk -F \: '{print $2}'|sort -d
The features/steps/ping_steps.rb
step (Listing 4) performs a simple ping test. The feature that would let you ping http://www.xing.com via ICMP would look like this:
Listing 4: features/steps/ping_steps.rb
01 When /^I ping (.*)$/ do |host| 02 @result = system("ping -c1 #{host} > /dev/null 2>&1") 03 end 04 05 Then /^it should respond$/ do 06 @result.should be_true 07 end 08 09 Then /^it should not respond$/ do 10 @result.should be_false 11 end
Feature: www.xing.com It should respond Scenario: Ping test When I ping www.xing.com Then it should respond
Each step is introduced by the methods Given
, When
, Then
, And
, or But
. A regular expression that enables the instruction used by the feature follows. The do
command introduces the logic, and |host|
is the variable that defines the feature for the tested host.
The important thing to note here is that the methods are not part of the regular expression. Whether you say When I ping
or define Given I ping
as a condition is variable, and takes into account the "Given
–When
–Then
" convention present in Cucumber [8].
Line 2 of Listing 4 finally does the work. It initiates the @result
instance variable, which receives the output from the ping
command. This is then evaluated in the steps starting in line 5. On the basis of the evaluation, the return value is set either to true (host available) or false (host unreachable).
If you now extend the example and provide the author of the test scenario with the ability to define the number of pings, you'll need to modify the regular expression and add the corresponding variable, according to the following:
When /^I ping (.*) (.*) times$/ do |host, count|@result = system("ping -c #{count} #{host} > /dev/null 2>&1") end
You can now define the number of pings to send in the matching features, as follows:
When I ping www.xing.com 3 times Then it should respond
Nagios Integration
Integration with your own monitoring system is easily accomplished. As an example, I will be using the king of open source monitoring systems: Nagios.
As I mentioned previously, the basic requirement for executing the Cucumber-Nagios plugin is a working Ruby installation. If you evaluated Cucumber-Nagios on a server other than the Nagios server, you'll need to copy the project directory recursively to your Nagios server.
The path to the directory depends on your distribution and the way in which you configured Nagios. Debian, Red Hat, and SUSE Linux install the Nagios plugins in /usr/lib/nagios/plugins
.
However, if you built Nagios yourself, the default file is /usr/local/nagios/libexec
. You probably will want to create a cucumber-nagios
subdirectory in the plugin directory and copy the project to this subdirectory. You will then want to modify the file and directory permissions for the system to allow the Nagios user to access the files.
In many cases, the resource.cfg
sets a variable for the plugin directory – for example, $USER1$=/usr/lib/nagios/plugins
. To keep the configuration overhead as lean as possible, and to avoid losing track, define your own variable for the path to the Cucumber-Nagios project – for example, $USER2$=/usr/lib/nagios/plugins/cucumber-nagios
.
Now, you just need the command for checking the service in the commands.cfg
file:
define command{ command_name check_cn command_line $USER2$/checks/bin/cucumber-nagios $ARG1$ }
The command_name
is freely selectable. For command_line
, bear in mind that the path variable must be correct and that $ARG1$
is appended to the command itself. Exactly one argument is passed to Cucumber-Nagios when called. The service definition in services.cfg
defines this argument (Listing 5).
Listing 5: services.cfg
01 define service{ 02 use local-service 03 host_name localhost 04 service_description CN - www.xing.com - startpage 05 check_command check_cn!$USER2$/checks/features/www.xing.com/startpage.feature 06 }
Before you tell the Nagios daemon to reload its configuration, test the checks you defined at the command line. To do so, run the plugin in the Nagios user context:
su nagios -c "/usr/lib/nagios/plugins/cucumber-nagios/checks/bin/cucumber-nagios /usr/lib/nagios/plugins/cucumber-nagios/checks/features/www.xing.com/startpage.feature" ~ $ echo $?
If this fails with a return value of 2
, you have just been bitten by a bug; fortunately, it's easy to remedy. The --pretty
option tells the tool to output the backtrace, which will help you troubleshoot the problem (Listing 6).
Listing 6: Troubleshooting with Backtrace
01 ~ $ [sudo] su nagios -c "/usr/lib/nagios/plugins/cucumber-nagios/checks/bin/cucumber-nagios /usr/lib/nagios/plugins/cucumber-nagios/checks/features/www.xing.com/startpage.feature --pretty"; 02 Feature: www.xing.com 03 It should be up 04 05 Scenario: Visiting home page # /usr/lib/nagios/plugins/cucumber-nagios/checks/features/www.xing.com/startpage.feature:4 06 When I go to http://www.xing.com/ # steps/webrat_steps.rb:1 07 Permission denied - webrat.log (Errno::EACCES) 08 /usr/lib64/ruby/1.8/logger.rb:518:in `initialize' 09 /usr/lib64/ruby/1.8/logger.rb:518:in `open' 10 /usr/lib64/ruby/1.8/logger.rb:518:in `open_logfile' 11 /usr/lib64/ruby/1.8/logger.rb:487:in `initialize' 12 /usr/lib64/ruby/1.8/logger.rb:263:in `new' 13 /usr/lib64/ruby/1.8/logger.rb:263:in `initialize' 14 /usr/lib/nagios/plugins/cucumber-nagios/checks/features/steps/webrat_steps.rb:2:in `/^I go to (.*)$/' 15 /usr/lib/nagios/plugins/cucumber-nagios/checks/features/www.xing.com/startpage.feature:5:in `When I go to http://www.xing.com/' 16 Then the request should succeed # steps/result_steps.rb:13 17 And I should see "Join XING for free" # steps/result_steps.rb:1 18 And I should not see "Welcome to Facebook" # steps/result_steps.rb:5 19 20 Failing Scenarios: 21 cucumber /usr/lib/nagios/plugins/cucumber-nagios/checks/features/www.xing.com/startpage.feature:4 # Scenario: Visiting home page 22 23 1 scenario (1 failed) 24 4 steps (1 failed, 3 skipped) 25 0m0.008s
Line 8 of the backtrace in Listing 6 shows that the issue is caused by the file permissions: The Nagios user wants to write the webrat.log
file but does not have the required permissions.
The easy explanation for this is the call to the program. If you had called su
with the -l
option, the $HOME
environmental variable would have been set for the Nagios user. But now the Nagios user is trying to write to the current working directory of the user who gave the su
command. This problem is tricky if you run Cucumber-Nagios Checks via SSH, because then SSH doesn't run as an interactive shell.
To get rid of the bug, just make a minor change to the code of the Cucumber-Nagios dependency, Webrat. To do so, open the vendor/gems/ruby/1.8/gems/webrat-0.7.0/lib/webrat/core/logging.rb
file in the project directory and type the full pathname to the logfile in line 18. Depending on the Nagios user's home directory, the line will look something like:
@logger ||= ::Logger.new("/var/nagios/webrat.log")
Finally, you should test the call to the plugin manually. The results will look far more satisfactory:
CUCUMBER OK - Critical: 0, Warning: 0, 4 okay | passed=4; failed=0; nosteps=0; total=4
Now, nothing should stand in the way of using Cucumber-Nagios to monitor functions. After reloading the daemon, Nagios will monitor the defined check. This completes the process of integration with the monitoring system. Figure 1 shows the results in a Nagios interface.
Conclusions
Cucumber-Nagios is without a doubt a powerful tool, but some experience is necessary to achieve fast results. Integration with an existing system is typically a slow process. The users will need to be introduced to the Cucumber and Gherkin idiom. And, creating specific steps for your own platform is virtually impossible without knowledge of Ruby.
Environments in which the development and operation teams cooperate closely have a clear advantage here. Developers contribute the application logic, QA staff define test scenarios, and administrators focus on the operative side of their monitoring systems. But, this does assume collaboration between teams.
Besides minor bugs and a huge wish list [9] for Cucumber-Nagios, the project also could include more predefined steps.
Now it's up to the open source community to contribute. If this happens, Cucumber-Nagios definitely has the potential to become more widespread and could pave the way for a behavior-driven infrastructure.