Management GFS Lead image: © Maxim Borovkov, 123RF.com

High-availability workshop: GFS with DRBD and Pacemaker

Simultaneous

Cl uster filesystems such as GFS2 and OCFS2 allow many clients simultaneous access to a storage device. Along with DRBD and Pacemaker, this offers a low-budget option for creating a redundant service – but you need to watch out for a couple of pitfalls. By Martin Loschwitz

Cluster filesystems are most frequently seen in the context of high-availability (HA) clusters. The most popular filesystems of this type are GFS2 and OCFS2 – although Lustre has attracted much attention – and NFS version 4 offers a similar service (pNFS). Of course, you can argue the pros and cons of cluster filesystems until the cows come home (see the "Risks of DRBD in Dual-Primary Mode" boxout), but once the decision is made for a cluster filesystem, Pacemaker and DRBD will help you ensure high availability for the system.

Risks of DRBD in Dual-Primary Mode

Running a DRBD resource in dual-primary mode is a mandatory requirement for using GFS2 with the resource. It is advisable not to underestimate the risks that emanate from this kind of DRBD use.

Standard Filesystems

Filesystems exist for a good reason: They make memory space manageable. Without a filesystem, it would be difficult to find data that you write to the storage medium when you need it. The filesystem structure makes it possible to retrieve and modify content on the storage medium. As long as nothing gets in the filesystem's way, this principle works well. And, as long as a storage medium is consistent, everything is well in the admin's universe.

Filesystems always assume that they are the only entity allowed to access a storage medium – or, to be more precise, that only the filesystem instance that belongs to a specific mount is allowed to do so. Simultaneous access by two filesystems to the same storage medium is not included in the concept of the standard filesystem on Linux. The filesystems in Linux thus go to great lengths to prevent simultaneous access by two or more instances of a filesystem to the same medium. You can't remount a filesystem at a different position if it is already mounted.

As long as the medium is only available to a single system, the Linux kernel has no trouble upholding this principle because it has control priority over the available hardware. That is no longer the case with DRBD: The DRBD resource exists on two cluster nodes, and the ext3 driver on node A wouldn't have the option of querying the state of the same DRBD resources on node B if it wanted to write to the medium.

In the worst case, a write to the DRBD resource would occur on both sides of the cluster, with the two nodes initially not knowing about their counterparts on the other side. This situation is known as a concurrent write. If two instances write to a medium simultaneously in an uncoordinated way, damage to the filesystem is highly likely. Admins then need to restore their backups, and service downtime is just an unpleasant side effect.

Ideas?

DRBD protects itself against concurrent writes by always running in primary-secondary mode in the default configuration. On nodes where a resource is running in secondary mode, the DRBD driver makes sure that no user access to the resource occurs. This setup completely removes the risk of corrupting the filesystem. The downside: Cluster filesystems like GFS or OCFS2 rely on the principle of having unrestricted access to all storage media within a storage network. They have their own mechanisms for avoiding concurrent writes and thus do not rely on DRBD protection. For GFS or OCFS2 to work on DRBD resources, DRBD must be configured to support write access to both nodes for the cluster filesystem. This article describes the DRBD configuration in detail.

The Bad News

Do not assume that running DRBD resources in dual-primary mode is totally unproblematic. On one hand, you need to look at the details of the cluster configuration. In a cluster that runs DRBD resources in dual-primary mode, a working STONITH setup is a must-have; otherwise, no node has the ability to exclude another node from the cluster if the second node is causing problems. On the other hand, fencing can cause unpleasant side effects in the context of cluster filesystems. For example, if one node decides that it needs to use STONITH to kill another node, the network filesystem will block all of the I/O traffic – on all nodes belonging to the storage pool – until the purportedly rogue node has been safely ousted. This can quickly turn into a problem in terms of performance.

Think Twice

If you think about the drawbacks involved with dual-primary mode, it is hard to understand why so many setups rely on this kind of configuration. The web contains many examples of clustering guides that create different configuration files on DRBDs with clustering filesystems to keep the files available on both cluster nodes. Considering the enormous complexity that this adds to the setup, solutions based on csync2, Rsync, or Puppet/Chef are definitely preferable. A setup with a clustering filesystem should only be considered if there are good reasons for it. You could also think about an OCFS2 setup for Oracle.

Incidentally, some people think they can fool ext3 and company by using dual-primary mode for DRBD and restricting access to the filesystem from both sides of the cluster to ready-only mode. This is definitely not a good idea: Filesystems in Linux always assume that they – and only they – have complete control over a filesystem. Cached data on a read-only filesystem is not updated in the cache – after all, the filesystem driver must assume that the data can't have changed in the meantime. Even though this scenario will not corrupt the filesystem on the two nodes, the clients will soon find themselves retrieving obsolete data on read access.

In this article, I will look at deploying Pacemaker as a cluster manager for clustering filesystems using GFS as an example. Unfortunately, the four "major league" enterprise distributions – Debian, Ubuntu, SLES, and RHEL – completely fail to agree on the right kind of setup for this scenario. The void is particularly obvious when you compare Red Hat's approach with that of the other three distributions.

GFS History

The clustering filesystem GFS has actually been around since 1995, but it didn't start to make a name for itself until Red Hat acquired the vendor Sistina in 2005 and started to push development. GFS officially made its way into Linux when Linus Torvalds added it to kernel version 2.6.19. GFS2 was thus a later entry to the kernel than the comparable OCFS2, which made it into kernel 2.6.16.

In the kernel, GFS relies on the Distributed Lock Manager (DLM) structure. This software, which was also contributed by Red Hat developers, is basically a large framework that coordinates simultaneous access to storage resources. GFS2 only works if the Distributed Lock Manager is enabled and working properly. Red Hat's desire to see other software products use DLM was fulfilled: cLVM, the LVM cluster variant, also uses DLM, which is a logical choice because Red Hat is massively involved in LVM's development.

If you want to use DLM and GFS, you first need to make sure the kernel modules are loaded. Both of these components also need a userspace counterpart to handle communication with the kernel modules and provide an interface for other programs. For example, a control daemon for GFS ensures that the necessary exchange of data between the nodes in a GFS cluster really does take place.

And, this is precisely the issue: To save the effort of implementing a separate cluster manager for GFS (like the one that, say, Oracle created for OCFS2), the control daemons have always been tightly integrated with Red Hat's own Cman cluster manager in GFS's case. But, the events of the past three years in terms of integrating GFS with available cluster managers will seem more like a bad joke for anybody who has not followed the developments.

Tunnel Vision

While Corosync – mainly propagated by Red Hat – established itself as the future solution for cluster communication, Pacemaker's star continued to rise. Red Hat then created control daemons for DLM and GFS to support the integration of the software with Pacemaker. The binaries responsible for this are dlm_controld.pcmk and gfs_controld.pcmk; like other resources, they are launched via Pacemaker resource agents. Because both DLM and GFS have become part of the Red Hat Cluster Suite (RHCS), the control daemons for Pacemaker were delivered as part of the RHCS scope. Thus, up to RHCS 3.0, everything was fine if you wanted Pacemaker and GFS to cooperate.

But Red Hat changed its policy in Red Hat Cluster Suite 3.1, deciding to massively re-engineer its own cluster product so that Pacemaker – whose main developer had been taken on by Red Hat in the meantime – was given an interface for Cman. The idea was for the CRM to use this interface to communicate with the tools that control DLM and GFS. From Red Hat's point of view, the Pacemaker control daemons for DLM and GFS thus became superfluous, and they were simply removed.

For understandable reasons, Novell wanted to avoid making RHCS part of its own product, although that would be perfectly fine from a licensing point of view. Instead, SUSE will maintain the legacy RHCS 3.0 control daemons autonomously in future. This means that there are currently two methods of running GFS on Linux with Pacemaker – and they differ vastly. On RHEL 6, you need Pacemaker with Cman integration, and on Debian, Ubuntu, and SLES, you need the legacy variant that uses separate control daemons for DLM and GFS in Pacemaker.

Incidentally, the next major change that affects GFS is just around the corner: Red Hat's roadmap envisions replacing Cman completely with Pacemaker. In the not-too-distant future, you can expect Red Hat to make the control daemons for Cman Pacemaker-compatible, which will take them exactly where they were in version 3.0 of RHCS.

That's the theory; now, I'll move on to some action, starting with a GFS cluster setup for Debian, Ubuntu, and SLES. Debian and Ubuntu include Pacemaker in their standard distributions but not exactly the latest version. Pacemaker update packages are available in the Ubuntu HA PPA [1] or in the backports directory for Squeeze [2]. If you use SLES on your servers, you need the High Availability Extension (HAE), which includes packages for GFS. Alternatively, there are third-party packages on the web.

Preparation

As I mentioned earlier, you need the "legacy" daemons for Ubuntu 10.04 and Debian Squeeze – both distributions include them in the gfs-pcmk and dlm-pcmk packages, which should be available on the system. In SLES, you can search for the required packages. This article assumes that you have installed DRBD and that at least one DRBD resource is available.

You need to set up the resource so that it can run in dual-primary mode (Figure 1).

To switch DRBD to dual-primary mode, you need to configure the net section for the resource and add fencing at resource level. — Figure 1: To switch DRBD to dual-primary mode, you need to configure the `net` section for the resource and add fencing at resource level.

Assuming a "normal" resource, as described in the DRBD article in the HA series [3], this will mean two changes to the resource's configuration.

On one hand, you need to add an allow-two-primaries yes; entry to the net section of the resource's configuration. On the other hand, you need to tell DRBD what to do if a split-brain situation occurs by adding following lines

after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;

to the resource's net section.

Additionally, Pacemaker must be installed with its basic configuration complete. At this point, I should mention that Pacemaker relies on a working STONITH configuration; this is the only option the cluster manager has for excluding cluster nodes that have run haywire. The final assumption I make in this article is that the DRBD resource is already configured as such in Pacemaker, and that it is available as ms_drbd_gfs. Note that you need to create the ms resource in a slightly different way than when a DRBD is running in primary-secondary mode:

ms ms_drbd_gfs p_drbd_gfs meta master-max=2 clone-max=2 notify=true

After fulfilling these requirements, you need to switch the DRBD resource on both sides of the cluster to Primary mode by issuing the

drbdadm primary resource

command (Figure 2).

Figure 2: After configuring DRBD correctly, the dual-primary will work, as you can see here.

GFS2 in the Cluster

The next step is to configure the resource in Pacemaker. The cluster needs two services for GFS: dlm_controld, which handles communication with the Distributed Lock Manager, and gfs_controld.pcmk, which controls the GFS2 daemon. The two services are managed via the ocf:pacemaker:controld resource agent. The configuration in the CRM shell for this example should look similar to Listing 1.

Listing 1: Resource Configuration

01 primitive p_controld_dlm ocf:pacemaker:controld op monitor interval="120s"
02 primitive p_controld_gfs ocf:pacemaker:controld params daemon="gfs_controld.pcmk" args="" op monitor interval="120s"
03 clone cl_controld_dlm p_controld_dlm meta globally-unique="false" interleave="true"
04 clone cl_controld_gfs p_controld_gfs meta globally-unique="false" interleave="true"
05 colocation co_dlm_always_with_drbd_master inf: cl_controld_dlm ms_drbd_gfs:Master
06 colocation co_gfs_always_with_dlm inf: cl_controld_gfs cl_controld_dlm
07 order o_gfs_always_after_dlm inf: cl_controld_dlm cl_controld_gfs
08 order o_dlm_always_after_drbd inf: ms_drbd_gfs:promote cl_controld_dlm

The entries here tell Pacemaker that the control daemons should be running on all cluster nodes – the clone entries take care of this. The constraints ensure that the DLM daemon only starts where the DRBD resource is running in primary mode, and that it waits for DML and the GFS control daemon to launch before starting.

Once DLM and the GFS control daemons are running, the GFS filesystem can be assigned to the DRBD resource:

sudo mkfs.gfs2 -p lock_dlm -j2 -t pcmk:pcmk resource

resource is the DRBD resource device node – for example, /dev/drbd/by-res/disk0/0.

What's missing now is the filesystem resource in Pacemaker, which mounts GFS on the cluster nodes. The ocf:heartbeat:Filesystem command supports GFS2, and the matching resource configuration is shown in Listing 2.

Listing 2: GFS2 Filesystem Resource

01 primitive p_filesystem ocf:heartbeat:Filesystem params
02 device="/dev/drbd/by-res/disk0/0" directory="/opt" fstype="gfs2"
03 op monitor interval="120s" meta target-role="Started"
04 clone cl_filesystem p_filesystem meta interleave="true" ordered="true"
05 colocation p_filesystem_always_with_gfs inf: cl_filesystem cl_controld_gfs
06 order o_filesystem_always_after_gfs inf: cl_controld_gfs cl_filesystem

In combination with the clone entry, the primitive resource ensures that the filesystem runs on all the cluster nodes; however, the constraints stipulate that the resources are only allowed to run if the GFS control daemon is running. After issuing a commit in the CRM shell, the cluster should look like Figure 3.

crm_mon -1 shows that all GFS services are correctly configured and working. — Figure 3: `crm_mon -1` shows that all GFS services are correctly configured and working.

GFS2 on RHEL 6

The GFS setup on Red Hat Enterprise Linux 6 (and distributions compatible with it) is different. The control daemons that connect Pacemaker with DLM and the GFS do not exist. Instead, Pacemaker has a direct interface to Cman (Figure 4).These systems do not launch Corosync, which loads Pacemaker as a module, as has been the case in this HA series thus far. Instead, Pacemaker runs a Cman child process and receives critical information from Cman.

The ocf:pacemaker:controld resource agent only exists on Ubuntu, Debian, and SLES. Cman is responsible for controlling DLM and GFS on RHEL-compatible systems. — Figure 4: The `ocf:pacemaker:controld` resource agent only exists on Ubuntu, Debian, and SLES. Cman is responsible for controlling DLM and GFS on RHEL-compatible systems.

Configuring the DRBD resource on RHEL is no different from the configuration for Debian, Ubuntu, or SLES. RHEL also needs the DRBD resource to be set up in dual-primary mode. Once you have a DRBD configuration that fits the bill, you can turn to configuring Cman, which you will typically need to install first.

Cman for Pacemaker

Cman's configuration is located in /etc/cluster/cluster.conf. The file uses an XML-based syntax – as does Pacemaker internally. You only need to edit it once, and it will work perfectly. In the Pacemaker example, the cluster.conf might look like Listing 3.

Listing 3: cluster.conf

01 <?xml version="1.0"?>
02 <cluster config_version="1" name="mycluster">
03   <logging debug="off"/>
04   <clusternodes>
05     <clusternode name="pcmk-1" nodeid="1">
06       <fence>
07         <method name="pcmk-redirect">
08           <device name="pcmk" port="pcmk-1"/>
09         </method>
10       </fence>
11     </clusternode>
12     <clusternode name="pcmk-2" nodeid="2">
13       <fence>
14         <method name="pcmk-redirect">
15           <device name="pcmk" port="pcmk-2"/>
16         </method>
17       </fence>
18     </clusternode>
19   </clusternodes>
20   <fencedevices>
21     <fencedevice name="pcmk" agent="fence_pcmk"/>
22   </fencedevices>
23 </cluster>

This example also includes some fencing directives that force Cman to forward fencing requests of any kind directly to Pacemaker. The clusternode entries refer to the cluster nodes; instead of pcmk-1 and pcmk-2, you will need the cluster names for your environment here.

Once the Cman configuration file is set up, it's time to start the matching service: service cman start. The output at the command line should look like Figure 5. You can clearly see that Cman launches the control daemons for both DLM and GFS. In contrast to the variant for Debian and others, these services are not part of the cluster information base (CIB) monitored by Pacemaker on RHEL-compatible systems.

The service cman start command launches Cman, which then launches Pacemaker as the cluster manager proper. — Figure 5: The `service cman start` command launches Cman, which then launches Pacemaker as the cluster manager proper.

DRBD for the Cman/Pacemaker Team

When you first launch Pacemaker in this way, the CIB will obviously be empty. It is thus a good idea to take care of fencing and the basic Pacemaker settings now. The resource and the master/slave setup for DRBD can now be added to the CIB; you can use the same approach as for the other systems, as described previously.

Additionally, you need to ensure that Pacemaker launches DRBD in dual-primary mode on RHEL as well. You can set the value of master-max and clone-max to 2 for this. Again, the example refers to the resource as ms_drbd_gfs.

Creating a GFS2 Filesystem

The next step is to create a GFS filesystem on the DRBD resource, which is now running in primary mode on both sides of the cluster, thanks to Pacemaker. The command for this is

mkfs.gfs2 -p lock_dlm -j 2 -t pcmk:web device

where you need to replace device with the resource's device node: for example, /dev/drbd/by-res/drbd0/0.

Finally, Pacemaker lacks a filesystem resource that makes GFS usable on the cluster nodes, including clone rules and constraints (Listing 4).

Listing 4: Pacemaker Filesystem Resource

01 configure primitive p_gfs_fs ocf:heartbeat:Filesystem params
02 device="/dev/drbd/by-res/drbd0/0" directory="/opt" fstype="gfs2"
03 clone cl_gfs_fs p_gfs_fs
04 colocation co_cl_gfs_fs_always_with_ms_drbd_gfs_master inf: cl_gfs_fs ms_drbd_gfs:Master
05 order o_cl_gfs_fs_always_after_ms_drbd_gfs_promote inf: ms_drbd_gfs:promote cl_gfs_fs

Any other services that you want Pacemaker to launch on both sides of the cluster, once GFS is available on both nodes, would thus be integrated using colocation and order constraints with cl_gfs_fs.

Conclusions

Finding a meaningful deployment scenario for this GFS cluster constellation is probably far more difficult than setting up the whole thing. The requirement for a comprehensive fencing setup because of the dual-primary mode DRBDs makes this solution complex.

A replicated filesystem solution might soon send GFS and company to the back of the field, and Gluster and Ceph are already champing at the bit.