Tools openlava Workload Manager Lead image: © Konrad Bak, 123RF.com
© Konrad Bak, 123RF.com
 

openlava – Hot resource manager

Share the Load

One way several users can share a high-performance system is through a software tool called a resource manager. Openlava is an open source version of a commercial scheduler that is robust while being freely available, very scalable, and easy to install and customize. By Jeff Layton

One of the most critical pieces of software on a cluster is the resource manager, commonly called a job scheduler. It allows users to share the system in a very efficient and cost-effective way.

The idea is fairly simple: Users write small scripts, commonly called "jobs" that define what they want to run and the resources that are required, and then submit the jobs to the resource manager. When the resources are available, the resource manager executes the job script on behalf of the user. Typically, this approach is for batch jobs – that is, jobs that are not interactive – but it can also be used for interactive jobs in which the resource manager gives you a shell prompt to the node that is running your job.

You have several options when it comes to resource managers. Some of them are commercially supported and others are open source, either with or without a support option. The list of options is fairly long, but the one I want to cover in this article is called openlava [1].

openlava

Platform Computing has had one of the most successful and arguably robust commercially supported resource managers, named LSF, for many years. A while ago, Platform created a cluster management tool that is now called IBM Platform Cluster Manager (PCM) Note: Platform was acquired by IBM. PCM is an open source package, and Platform wanted to integrate an open source resource manager with the tool. In 2007 Platform took an older version of LSF, version 4.2, and created an open source resource manager, which they named Platform Lava [2] or just "Lava." It is still included in PCM today.

Not too long ago, some developers based a new resource manager on Lava, calling it openlava. Openlava has been developed to go beyond the incorrectly advertised cluster size of 512 nodes while adding functionality. Openlava has a nice mailing list, and the developers make binaries available for major distributions such as Red Hat 5 and 6, CentOS 5 and 6, and openSUSE.

Although openlava is tremendously easy to install, you have to pay attention to a few gotchas. (In my first attempt, I didn't pay attention to these.) In the next sections, I'll walk through the installation of openlava on a cluster (the Warewulf-based cluster [3] I've written about before) and present a few configuration options that might help.

Installing openlava

The installation of openlava can proceed in several ways. You can install it on each node in the cluster, or you can install it in a shared filesystem. It works just fine in either case, but unlike many other resource managers, openlava works really well on a shared filesystem, so you can make your life a bit easier by going this route.

I installed openlava on a Scientific Linux 6.2 [4] system using the RHEL 6 binaries [5]. If you read the Quickstart guide [6] on the openlava site, you will learn that this binary is installed into /opt/openlava-2.0. For the Warewulf cluster I used in my testing, this works out very well because /opt is NFS-exported from the master node to the compute nodes.

The first step was to install openlava on the master node of the cluster. This is very easy to do using:

[root@test1 RPMS]# yum install \
   openlava-2.0-206.1.x86_64.rpm

Just as a reminder, the binary I used installs openlava into the /opt/openlava-2.0 directory, which, for the test cluster, is an NFS shared filesystem. Fortunately, the design of openlava means I don't need to install the binaries into the VNFS for my Warewulf cluster. All I really need to do is install the startup (init) scripts for openlava into the VNFS.

To make sure I had everything copied over to the VNFS I wrote a small script (Listing 1). Only three small files are copied into the VNFS; the rest are just symlinks within the VNFS. Note that the VNFS root directory is /var/chroots/sl6.2, so you might have to change the path to your VNFS (if you are using Warewulf). If you are not using Warewulf, you have several options. You can install the RPM into the image for each compute node or, if you have compute nodes that mount the shared filesystem, you will have to make sure the above scripts and symlinks are in the image for each compute node.

Listing 1: Installing init Scripts into VNFS

<§§nonumber>
cp /etc/init.d/openlava /var/chroots/sl6.2/etc/init.d/openlava
cp /etc/profile.d/openlava.csh /var/chroots/sl6.2/etc/profile.d/openlava.csh
cp /etc/profile.d/openlava.sh /var/chroots/sl6.2/etc/profile.d/openlava.sh
cd /var/chroots/sl6.2/etc/rc.d/rc0.d
ln -s ../init.d/openlava K01openlava
cd /var/chroots/sl6.2/etc/rc.d/rc1.d
ln -s ../init.d/openlava K01openlava
cd /var/chroots/sl6.2/etc/rc.d/rc2.d
ln -s ../init.d/openlava S99openlava
cd /var/chroots/sl6.2/etc/rc.d/rc3.d
ln -s ../init.d/openlava S99openlava
cd /var/chroots/sl6.2/etc/rc.d/rc4.d
ln -s ../init.d/openlava S99openlava
cd /var/chroots/sl6.2/etc/rc.d/rc5.d
ln -s ../init.d/openlava S99openlava
cd /var/chroots/sl6.2/etc/rc.d/rc6.d
ln -s ../init.d/openlava K01openlava

After running this script, which you have to do as root, you just rebuild the VNFS with:

[root@test1 ~]# wwvnfs \
   --chroot /var/chroots/sl6.2

When it asks whether you want to overwrite, answer y. Before you can configure openlava, you need to create an openlava account for the openlava user. This isn't difficult to do – just use the useradd command:

[root@test1 ~]# useradd \
   -d /home/openlava -g openlava \
   -s /bin/bash openlava

I put the openlava home directory in /home and created a group called openlava.

One very important point is that the openlava user has to be on each system – the master node and all compute nodes that are running openlava. On the master node, I added a user, so there are new entries in /etc/passwd and /etc/shadow that are included as "files" in Warewulf, so now I need to tell Warewulf to update its version of these files. (Note: These files are automatically pushed to the compute nodes by Warewulf when they are booted.) This is easily done with the command:

[root@test1 ~]# wwsh \
   file sync

If you have problems getting openlava to work, be sure to check that user openlava exists on all nodes.

One last thing to do before editing the openlava configuration files is to make sure the command ./etc/profile.d/openlava.sh is run by every user. To make things easy, I put the command in .bashrc because I'm running the Bash shell. This needs to happen for all users and root.

Next, you can edit the openlava configuration. All of the files you really need to worry about are in /opt/openlava-2.0/etc and are pretty easy to edit. In the configuration files, you need to include both the master node and all compute nodes, even if you don't want compute jobs to be run on the master node (this is easy to change when you want).

The first file to edit is /opt/openlava-2.0/etc/lsf.cluster.openlava, which defines which nodes are part of this openlava cluster. The changes you need to make are very simple, and Listing 2 is what I use for my system. The nodes that are in the openlava "cluster" are between the lines Begin Host and End Host. The first node I listed is test1, which is my master node, and the second node, n0001, is my first compute node. Note that I used the defaults after the node name.

Listing 2: Main Cluster Config File

<§§nonumber>
[root@test1 etc]# more lsf.cluster.openlava
#-----------------------------------------------------------------------
# T H I S   I S   A    O N E   P E R   C L U S T E R    F I L E
#
# This is a sample cluster definition file.  There is a cluster
# definition file for each cluster.  This file's name should be
# lsf.cluster..
# See lsf.cluster(5) and the "LSF Administrator's Guide".
#
Begin   ClusterAdmins
Administrators = openlava
End    ClusterAdmins
Begin     Host
HOSTNAME                model          type  server  r1m  RESOURCES
test1                   !              !     1       -    -
n0001                   !              !     1       -    -
End     Host
Begin ResourceMap
RESOURCENAME  LOCATION
# tmp2             [default]
# nio             [all]
# console     [default]
End ResourceMap

The second file I edited was /opt/openlava-2.0/etc/lsb.hosts. In this file, you can tell openlava about how many "slots" or cores your nodes have. For my cluster, the file looks like Listing 3.

Listing 3: Loading Environment Modules

<§§nonumber>
[root@test1 etc]# more lsb.hosts
#
# The section "host" is optional. If no hosts are listed here, all hosts
# known by LSF will be used by Batch. Otherwise only the hosts listed will
# be used by Batch.  The value of keyword HOST_NAME may be an official host
# name (see gethostbyname(3)), a host type/model name (see lsf.shared(5)), or
# the reserved word "default". The type/model name represents each of the
# hosts which are of that particular host type/model. The reserved
# word "default" represents all other hosts in the LSF cluster.
# MXJ is the maximum number of jobs which can run on the host at one time.
# JL/U is the maximum number of jobs belonging to a user that can run on the
# host at one time. The default is no limit.
# DISPATCH_WINDOW is the time windows when the host is available to run
# batch jobs.  The default dispatch window is always open.
# Other columns specify scheduling and stopping thresholds for LIM load
# indices. A "()" or "-" is used to specify the default value in a column
# and cannot be omitted.
# All the host names (except default) in this example are commented out,
# since they are just examples which may not be suitable for some sites.
# Don't use non-default thresholds unless job dispatch needs to be controlled.
Begin Host
HOST_NAME     MXJ JL/U   r1m    pg    ls     tmp  DISPATCH_WINDOW  # Keywords
test1         0   ()     ()     ()    ()     ()   ()
n0001         3   ()     ()     ()    ()     ()   ()
default       !   ()     ()     ()    ()     ()   ()               # Example
End Host
# Host groups can be referenced by the queue file.  Each line defines a host
# group.  The first line contains key words; each subsequent line contains a
# group name, followed by white space, followed by the list of group members.
# The list of members should be enclosed in parentheses and separated by
# white space.  This section is optional.
# This example is commented out
#Begin HostGroup
#GROUP_NAME    GROUP_MEMBER   # Key words
#group0       (host0 host1)   # Define a host group
#End HostGroup

To get started, just edit the lines between Begin Host and End Host. The line with the default host was already there, but I added the lines for the master node, test1, and the first compute node, n0001. The number in the MXJ column refers to the maximum number of jobs that can run on the host at one time. In my case, the compute node, n0001, has only three cores, so I gave a value of 3. I chose not to run any jobs on the master node, so I entered 0 under MXJ. If I wanted to run on the master node, I would have entered a value of 4 because I have four cores in my master node.

Booting the Nodes

Booting openlava on both the master node and the compute node is not difficult, but a few commands can help make sure everything is working correctly. To begin, start openlava on the master node (test1) Because openlava doesn't start when the master node boots, I used the service command to start it the first time:

[root@test1 ~]# service openlava start
Starting daemons...
lim started
res started
sbatchd started

I also recommend using the service openlava start command to make sure all of the daemons were started correctly. You should see something like the following:

[root@test1 ~]# service openlava status
lim pid: <2707>
res pid: <2709>
sbatchd pid: <2711>
lim mbatchd: <2718>

The numbers in the brackets are the process IDs (PIDs) for each daemon. On the master node, you should see four PIDs, which in fact is the case.

You can read more about these daemons in the openlava documentation [7], but if you don't want to read an architecture document, here is a quick overview:

The openlava architecture is very simple, and these four daemons interact in a very transparent way.

You should then use the lsid command to make sure the LIM daemon has started correctly:

[root@test1 ~]# lsid
openlava project 2.0, Jun 8 2012
My cluster name is openlava
My master name is test1

The output from this command gives you the openlava cluster name; the default is openlava. It also tells you the name of the master for the openlava cluster. In this case, it's test1, which makes sense because I have started it only on the master node.

I like to do a few more checks to make sure everything is working correctly. For the first check, I use lshosts (Listing 4). You can tell that monitoring (LIM) on the master node is working because it reports the number of cores (ncpus) correctly (4), and it reports the maximum memory and the maximum swap space (maxmem and maxswp, respectively). For a second check, I use bhosts, which lists the hosts in the openlava cluster (Listing 5).

Listing 5: bhosts Check

<§§nonumber>
[root@test1 ~]# bhosts
HOST_NAME    STATUS    JL/U   MAX  NJOBS    RUN  SSUSP  USUSP   RSV
n0001        unavail      -     3      0      0      0      0     0
test1        closed       -     0      0      0      0      0     0

Listing 4: lshosts Check

<§§nonumber>
[root@test1 ~]# lshosts
HOST_NAME      type    model  cpuf ncpus maxmem maxswp server RESOURCES
test1       DEFAULT  DEFAULT   1.0     4  7363M 13999M    Yes ()
n0001       UNKNOWN UNKNOWN_   1.0     -      -      -    Yes ()

You can tell that openlava is working up to this point because the master node (test1) reports closed, which is correct because I told openlava that it had no slots (CPUs) for running jobs. The output says that the first compute node, n0001, is unavail, which is correct because I haven't started it, so openlava daemons can't contact it. The next step is to boot the compute node. After it boots, you can log in and, as root, start up openlava (in fact, it might have started when it booted):

bash-4.1# service openlava start
Starting daemons...
lim started
res started
sbatchd started

The daemons started up correctly, so the next thing to do is use the command service openlava status to check the status of the daemons on the compute node"

bash-4.1# service openlava status
lim pid: <1667>
res pid: <1669>
sbatchd pid: <1671>
lim mbatchd: <>

Note that the compute node has no PID for mbatchd. This makes perfect sense because the compute node is not the master node, so it will not be running mbatchd.

Next, I'll check the status using lsid:

bash-4.1# lsid
openlava project 2.0, Jun 8 2012
My cluster name is openlava
ls_getmastername(): Cannot locate master LIM now, try later

When the compute node starts, it takes a while for all of its information to flow to the master node, which is why you see an error contacting the master node's LIM process. If you see this error, you should wait a few minutes to see whether the compute node LIM starts communicating with the master node's LIM. I only waited about 60 seconds and reran the command:

bash-4.1# lsid
openlava project 2.0, Jun 8 2012
My cluster name is openlava
My master name is test1

Once the compute node LIM and the master node LIM are communicating, you will get a proper response (at least not an error). The output from lsid shows the correct openlava cluster name (openlava) and the correct openlava master name, test1, so the LIMs are communicating correctly.

The next command to run on the compute node is lshosts (Listing 6). You can tell that the monitoring (LIM) on the compute node is working because it reports the number of cores (ncpus) correctly (3), and it reports the maximum memory and the maximum swap space (maxmem and maxswp). However, the compute node does not have any swap space, so the entry for maxswp is not listed.

Listing 6: Check Host Resources

<§§nonumber>
bash-4.1# lshosts
HOST_NAME      type    model  cpuf ncpus maxmem maxswp server RESOURCES
test1       DEFAULT  DEFAULT   1.0     4  7363M 13999M    Yes ()
n0001       UNKNOWN UNKNOWN_   1.0     3  2887M      -    Yes ()

The bhosts command goes one step further by listing the hosts in the openlava cluster (Listing 7). The output also says that the first compute node, n0001, is ok, but the status of the openlava master node is closed (because I told it not to run any jobs).

Listing 7: List Hosts in Cluster

<§§nonumber>
bash-4.1# bhosts
HOST_NAME     STATUS   JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
n0001         ok          -      3      0      0      0      0      0
test1         closed      -      0      0      0      0      0      0

It looks like everything is running correctly, but it's important to take a look at the openlava master node once the compute node is up and running. To do this, I will run the command lshosts and bhosts on the openlava master node (Listing 8).

Listing 8: Checking the Master Node

<§§nonumber>
[root@test1 ~]# lshosts
HOST_NAME      type    model  cpuf ncpus maxmem maxswp server RESOURCES
test1       DEFAULT  DEFAULT   1.0     4  7363M 13999M    Yes ()
n0001       UNKNOWN UNKNOWN_   1.0     3  2887M      -    Yes ()
[root@test1 ~]# bhosts
HOST_NAME    STATUS     JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
n0001        ok            -      3      0      0      0      0      0
test1        closed        -      0      0      0      0      0      0

Everything looks normal at this point. The lshosts command indicates that the compute node LIM is communicating with the openlava master node LIM because the number of CPUs (ncpus) is correct (3), and the maxmem number is correct (2887M). The output from the bhosts command is also correct because the status of the compute node, n0001, is ok, and the number of job slots, MAX, is correct (3).

Running Jobs

Openlava appears to be working correctly, but to be absolutely sure, I will throw some job scripts at it. The first simple script (test1.script) makes sure openlava is passing hostnames correctly (Listing 9). I won't go into the details of the script because the openlava site documents it, and LSF examples are available on the web. But, because I standardized on this list of openlava "commands," I will explain them briefly. Any line that starts with #BSUB is a directive for openlava. The directives I used are:

Listing 9: First Job Script

<§§nonumber>
#!/bin/bash
#
# LSF test1
#
#BSUB -P test1                       # Project test1
#BSUB -n 2
#BSUB -o test1.out                   # output filename
#BSUB -e test1.err                   # error filename
#BSUB -J test1                       # job name
#
for h in `echo $LSB_HOSTS`
do
  echo "host name: $h"
done

I submitted the job to openlava with the command:

[laytonjb@test1 TEST_OPENLAVA]$ bsub \
   -R "type=all" < test1.script

The bsub command is what you use to submit the script to openlava. I used the option -R "type=all" because I have a compute node that is different from the master node. Consequently, I need to tell openlava that it can use any node type – even ones it doesn't understand – for running the job. Listing 10 shows the two files returned by openlava, test1.err and test1.out. (My apologies for the length, but I think it's important to see at least what the output files from openlava look like.)

Listing 10: Output from First Job Script

<§§nonumber>
[laytonjb@test1 TEST_OPENLAVA]$ more test1.err
[laytonjb@test1 TEST_OPENLAVA]$ more test1.out
Sender: LSF System
Subject: Job 818:  Done
Job  was submitted from host  by user .
Job was executed on host(s) <2*n0001>, in queue , as user .
 was used as the home directory.
 was used as the working directory.
Started at Mon Sep 24 19:42:14 2012
Results reported at Mon Sep 24 19:42:16 2012
Your job looked like:
---------------------------------------------------------
# LSBATCH: User input
#!/bin/bash
#
# LSF test1
#
#BSUB -P test1                          # Project test1
#BSUB -n 2
#BSUB -o test1.out                      # output filename
#BSUB -e test1.err                      # error filename
#BSUB -J test1                          # job name
#
#
for h in `echo $LSB_HOSTS`
do
  echo "host name: $h"
done
#cat pgfile
---------------------------------------------------------
Successfully completed.
Resource usage summary:
    CPU time   :      0.03 sec.
The output (if any) follows:
host name: n0001
host name: n0001
PS:
Read file  for stderr output of this job.

One final test – running a simple MPI program – computes the value of pi (see the job script in Listing 11). It was built using the Open64 compiler and mpich2 (a previous article [8] explains the environment I used). Listing 12 shows how I submitted the job with a couple of queries on its status.

Listing 12: Submit a Script and Check Its Status

<§§nonumber>
[laytonjb@test1 TEST_OPENLAVA]$ bsub -R "type==any" < test2.script
Job <839> is submitted to default queue .
[laytonjb@test1 TEST_OPENLAVA]$ bjobs
JOBID   USER    STAT  QUEUE    FROM_HOST  EXEC_HOST   JOB_NAME   SUBMIT_TIME
839     laytonj PEND  normal   test1                  test2      Sep 24 20:24
...
[laytonjb@test1 TEST_OPENLAVA]$ bjobs
JOBID   USER    STAT  QUEUE    FROM_HOST  EXEC_HOST   JOB_NAME   SUBMIT_TIME
839     laytonj RUN   normal   test1      n0001       test2      Sep 24 20:24
                                          n0001
...
[laytonjb@test1 TEST_OPENLAVA]$ bjobs
No unfinished job found

Listing 11: Openlava Running MPI Code

<§§nonumber>
#!/bin/bash
#
# Test 2 script for openlava
#
#BSUB -P test2                          # Project test1
#BSUB -n 2
#BSUB -o test2.out                      # output filename
#BSUB -e test2.err                      # error filename
#BSUB -J test2                          # job name
# Change to correct directory (full path)
#  Not strictly necessary but a good practice
cd /home/laytonjb/TEST_OPENLAVA
# Load needed modules here
. /etc/profile.d/modules.sh
module load compilers/open64/5.0
module load mpi/mpich2/1.5b1-open64-5.0
# Write hosts to a file
for h in `echo $LSB_HOSTS`
do
   echo $h >> pgfile
   echo "host name: $h"
done
# Calculate the number of processors allocated to this run.
NPROCS=`wc -l < ./pgfile`
# Calculate the number of nodes allocated.
NNODES=`uniq ./pgfile | wc -l`
### Display the job context
echo "Running on host `hostname` "
echo "Start Time is `date` "
echo "Directory is `pwd` "
echo "Using ${NPROCS} processors across ${NNODES} nodes "
# Execute mpi command
mpirun -np 2 -machinefile pgfile ./mpi_pi < file1 > output.mpi_pi
# erase file with node names
rm ./pgfile
echo "End time is `date` "

In the first line, I submitted the job. After that I checked on the status with the openlava bjobs command. Note that the first check on the status of the jobs shows PEND (under the STAT column). After the job starts running, you see which "hosts" are executing the job (under the EXEC_HOST column). Finally, no more unfinished jobs are found, so I know the job has finished.

Summary

A resource manager allows systems and their resources to be shared efficiently. The resource manager discussed in this article, openlava, is based on Platform Lava, which in turn is an open source version of an older version of Platform LSF (now called IBM Platform LSF). So, openlava has a very distinguished pedigree.

Installing openlava was very easy, even using a shared filesystem. This isn't always the case with resource managers, so it is a big benefit of openlava. I recommend reading the installation instructions carefully and heeding every word. (I didn't the first time around.) In particular, pay attention to the following:

If you pay attention to these key issues, the installation and activation should go smoothly.

The tips I have included to make sure everything is working properly are the result of some simple mistakes I made. I hope they will help you as well. I also included a few configuration details I hope will help in your initial configuration. In particular, I showed how you can configure the master openlava node so it does not run jobs.

Finally, I wrote a couple of simple job scripts to show very briefly how you can write your own, including a simple MPI job. If you need more details on how to write scripts, you will find a plethora of information on the web, and the openlava mailing list is a wonderful place to get help.

When you think about resource managers, and you will if you run HPC systems, then openlava is a very serious contender. Take a look at it, and give it a try – it's really easy to install, configure, and manage.