The RADOS object store and Ceph filesystem
Object of Desire
Cloud computing is, without a doubt, the IT topic of our time. No issue pushes infrastructure managers with large-scale enterprises as hard as how to best implement the cloud. The IaaS principle is quite simple: The goal is to provide capacity to users in the form of computational power and storage in a way that means as little work as possible for the user and that keeps the entire platform as flexible as possible.
Put more tangibly, this means customers can request CPU cycles and disk space as they see fit and continue to use both as the corresponding services are needed. For IT service providers, this means defining your own setup to be as scalable as possible. It should be possible to accommodate peak loads without difficulty, and if the platform grows – which will be the objective of practically any enterprise – a permanent extension should be accomplished easily.
In practical terms, implementing this kind of solution tends to be more complex. Scalable virtualization environments are something that is easy to achieve: Xen and KVM, in combination with the current crop of management tools, make it easy to manage virtualization hosts. Scale out also is no longer an issue: If the platform needs more computational performance, you can add more machines that integrate seamlessly with the existing infrastructure.
Things start to become more interesting when you look at storage. The way IT environments store data has remained virtually unchanged in the past few years.
In the early 1990s, data centers comprised many servers with local storage, all of which suffered from legacy single points of failure. As of the mid-1990s, Fibre Channel HBAs and matching SAN storage entered the scene, offering far more redundancy than their predecessors, but at a far higher price. People who preferred a lower budget approach turned to DRBD with standard hardware a few years ago, thus avoiding what can be hugely expensive SANs. However, all of these approaches share a problem: They do not scale out seamlessly.
Scale Out and Scale Up
Admins and IT managers distinguish between two basic types of scalability. Vertical scalability (scale up) is based on the idea of extending the resources of existing devices, whereas horizontal scalability (scale out) relies on adding more resources to the existing set (Figure 1). Databases are a classic example of a scale-out solution: Typically, slave servers are added to support load distribution.
Scale out is completely new ground when it comes to storage. Local storage in servers, SAN storage, or servers with DRBD will typically only scale vertically (more disks!), not horizontally. When the case is full, you need a second storage device, and this will not typically support integration with the existing storage to provide a single unit, thus making maintenance far more difficult. In terms of SAN storage, two SANs just cost twice as much as one.
Object Stores
If you are planning a cloud and thinking about seamlessly scalable storage, don't become despondent at this point: Authors of the popular cloud applications are fully aware of this problem and offer workable solutions known as object stores.
Object stores follow a simple principle: All servers that become part of an object store run software that manages and exports the server's local disk space. All instances of this software collaborate on the cluster, thus providing the illusion of a single, large data store. To support internal storage management, the object store software does not save data in its original format on the individual storage nodes, but as binary objects. The number of individual nodes joining forces to create the large object store is arbitrary, and you can even add storage nodes on the fly.
Because the object storage software also has internal mechanisms to handle redundancy, and the whole thing works with standard hardware, a solution of this kind combines the benefits of SANs or DRBD storage and seamless horizontal scalability. RADOS has set out to be the king of the hill in this sector, in combination with the matching Ceph filesystem.
How RADOS Works
RADOS (reliable autonomic distributed object store, although many people mistakenly say "autonomous") has been under development at DreamHost, led by Sage A. Weil, for a number of years and is basically the result of a doctoral thesis at the University of California, Santa Cruz. RADOS implements precisely the functionality of an object store as described earlier, distinguishing between three different layers to do so:
- Object Storage Devices (OSDs). An OSD in RADOS is always a folder within an existing filesystem. All OSDs together form the object store proper, and the binary objects that RADOS generates from the files to be stored reside in the store. The hierarchy within the OSDs is flat: files with UUID-style names but no subfolders.
- Monitoring servers (MONs): MONs form the interface to the RADOS store and support access to the objects within the store. They handle communication with all external applications and work in a decentralized way. There are no restrictions in terms of numbers, and any client can talk to any MON. MONs manage the MONmap (a list of all MONs) and the OSDmap (a list of all OSDs). The information from these two lists lets clients compute which OSD they need to contact to access a specific file. In the style of a Paxos cluster, MONs also ensure RADOS's functionality in terms of respecting quorum rules.
- Metadata servers (MDS): MDSs provide Posix metadata for objects in the RADOS object store for Ceph clients.
What About Ceph?
Most articles about RADOS just refer to Ceph in the title, causing some confusion. Weil described the relationship between RADOS and Ceph as two parts of the same solution: RADOS is the "lower" part and Ceph the "upper" part. One thing is for sure: The best-looking object store in the world is useless if it doesn't give you any options for accessing the data you store in it.
However, Ceph offers precisely these options for RADOS: Ceph is a filesystem that accesses the object store in the background and thus makes its data directly usable in the application. The metadata servers help accomplish this task by providing the metadata required for each file that Ceph accesses in line with the Posix standard when a user requests a file via Ceph.
Because DreamHost didn't consider until a later stage of development that RADOS could be used as a back end for tools other than filesystems, they generated confusion regarding the names of RADOS and Ceph. For example, the official DreamHost guides refer simply to "Ceph" when they actually mean "RADOS and Ceph."
First RADOS Setup
Theory is one thing, but to gain some understanding of RADOS and Ceph, it makes much more sense to experiment on a "live object." You don't need much for a complete RADOS-Ceph setup. Three servers with local storage will do fine. Why three? Remember that RADOS autonomically provides a high-availability option. The MONs use the Paxos implementation referred to earlier to guarantee that there will always be more than one copy of an object in a RADOS cluster. Although you could turn a single node into a RADOS store, this wouldn't give you much in the line of high availability. A RADOS cluster comprising two nodes is even more critical. In a normal case, the cluster would have a quorum, but the failure of a single node would then make the other node useless because RADOS needs a quorum, and by definition, one can't make a quorum. In other words, you need three nodes to be on the safe side, so the failure of single node won't be an issue.
Incidentally, nothing can stop you from using virtual machines with RADOS for your experiments – RADOS doesn't have any functions that require specific hardware features.
Finding the Software
Before experimenting, you need to install RADOS and Ceph. Ceph, which is a plain vanilla filesystem driver on Linux systems (e.g., ext3 or ext4), made its way into the Linux kernel in Linux 2.6.34 and is thus available for any distribution with this or a later kernel version (Figure 2). The situation isn't quite as easy with RADOS; however, the documentation [1] points to prebuilt packages, or at least gives you an installation guide, for all of the popular distributions. Note that although the documentation refers to "ceph," the packages contain all of the components you need for RADOS. After installing the packages, it's time to prepare RADOS.
Preparing the OSDs
RADOS needs OSDs. As I mentioned earlier, any folder on a filesystem can act as an OSD; however, the filesystem must support extended attributes. The RADOS authors recommend Btrfs but also mention XFS as an alternative for anyone who is still a bit wary of using Btrfs. For simplicity's sake, I will assume in the following examples that you have a directory named osd.ID
in /srv
on three servers, where ID stands for the server's hostname in each case. If your three servers are named alice, bob, and charlie, you would have a folder named /srv/osd.alice
on server alice, and so on.
If you will be using a local filesystem set up specially for this purpose, be sure to mount it in /srv/osd.ID
. Finally, each of the hosts in /srv
also needs a mon.ID
folder, where ID again stands for the hostname.
In this sample setup, the central RADOS configuration in /etc/ceph/ceph.conf
might look like Listing 1. The configuration file defines the following details: Each of the three hosts provides an OSD and a MON server; host alice
is also running an MDS to ensure that any Ceph clients will find Posix-compatible metadata on access. Authentication between the nodes is encrypted. The keyring for this is stored in the /etc/ceph
folder and goes by the name of $name.keyring
, where RADOS will automatically replace name with the actual value later.
Listing 1: Sample /etc/ceph/ceph.conf
01 [global] 02 auth supported = cephx 03 keyring = /etc/ceph/$name.keyring 04 [mon] 05 mon data = /srv/mon.$id 06 [mds] 07 [osd] 08 osd data = /srv/osd.$id 09 osd journal = /srv/osd.$id.journal 10 osd journal size = 1000 11 [mon.a] 12 host = alice 13 mon addr = 10.42.0.101:6789 14 [mon.b] 15 host = bob 16 mon addr = 10.42.0.102:6789 17 [mon.c] 18 host = charlie 19 mon addr = 10.42.0.103:6789 20 [osd.0] 21 host = alice 22 [osd.1] 23 host = bob 24 [osd.2] 25 host = charlie 26 [mds.a] 27 host = alice
Most importantly, the nodes in the RADOS cluster must reach one another directly using the hostnames from your ceph.conf
file. This could mean adding these names to your /etc/hosts
file.
Additionally, you need to be able to log in to all of the RADOS nodes as root later on for the call to mkcephfs
, and root needs to be able to call sudo
without an additional password prompt on all of the nodes. After fulfilling these conditions, the next step is to create the keyring to support mutual authentication between the RADOS nodes:
mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/admin.keyring
Now you need to ensure that ceph.conf
exists on all the hosts belonging to the cluster (Figure 3). If this is the case, you just need to start RADOS on all of the nodes: Typing
/etc/init.d/ceph start
will do the trick. After a couple of seconds, the three nodes should have joined the cluster; you can check this by typing
ceph -k /etc/ceph/admin.keyring -c /etc/ceph/ceph.conf health which should give you the output shown in Figure 4.
Using Ceph to Mount the Filesystem
To mount the newly created filesystem on another host on one of the RADOS nodes, you can use the normal mount
command – the target host is one of the MON servers (i.e., alice in this example) with a MON address set to 10.42.0.101:6789
in ceph.conf
. Because CephX authentication is being used, I need to identify the login credentials automatically generated by Ceph before I can mount the filesystem. The following command on one of the RADOS nodes outputs the credentials:
ceph-authtool -l /etc/ceph/admin.keyring
The mount process then follows:
mount -t ceph 10.0.0.1:6789:/ /mnt/osd -vv -o name=admin,secret=mykey
where mykey
needs to be replaced by the value for key
that you determined with the last command. The mountpoint in this example is /mnt/osd
, which can be used it as a normal directory from now on.
The Crush Map
RADOS and Ceph use a fair amount of magic in the background to safeguard the setup against any kind of failure, starting with the mount. Any of the existing MON servers can act as a mountpoint; however, this doesn't mean communications are only channeled between the client and this one MON. Instead, Ceph on the client receives the MON and OSDmaps from the MON server it contacts and then references them to compute which OSD is best to use for a specific file before going on to handle the communication with this OSD.
The Crush map is another step for improving redundancy. It defines which hosts belong to a RADOS cluster, how many OSDs exist in the cluster, and how to distribute the files over these hosts for best effect. The Crush map makes RADOS rack-aware, allowing admins to manipulate the internal replication of RADOS in terms of individual servers, racks, or security zones in the data center. The setup shown for the example here also has a rudimentary default Crush map. If you want to experiment with your own Crush map, the Ceph wiki [2] gives you the most important information for getting started.
Extending the Existing Setup
How do you go about extending an existing RADOS cluster, by adding more nodes to increase the amount of available storage? If you want to add a node named daisy to the existing setup, the first step would be to define an ID for this node. In this example, IDs 0 through 3 are already assigned, and the new node would have an ID of 4, so you need to type:
ceph osd create 4
Then you need to extend the ceph.conf
files on the existing cluster nodes, adding an entry for daisy. On daisy, you also need to create the folder structure needed for daisy to act as an OSD (i.e., the directories in /srv
, as in the previous examples). Next, copy the new ceph.conf
to the /etc/ceph
folder on daisy.
Daisy also needs to know the current MON structure – after all, she will need to register with an existing MON later on. This means daisy needs the current MONmap for the RADOS cluster. You can read the MONmap on one of the existing RADOS nodes by typing
ceph mon getmap -o /tmp/monmap
(Figure 5), and then use scp
to copy it to daisy (this example assumes you are storing the MONmap in /tmp/monmap
on daisy). Now, you need to initialize the OSD directory on daisy:
ceph-osd -c /etc/ceph/ceph.conf -i 4 --mkfs --monmap /tmp/monmap --mkkey
If the additional cluster node uses an ID other than 4
, you need to modify the numeric value that follows -i
.
Finally, you need to introduce the existing cluster to daisy. In this example, the last command created an /etc/ceph/osd.4.keyring
file on daisy, which you can copy to copy one of the existing MONs with scp
. Doing the following:
ceph auth add osd.ID osd 'allow *' mon 'allow rwx'-i /etc/ceph/osd.4.keyring
on the same node adds the new OSD to the existing authentication structure (Figure 6). Typing /etc/init.d/ceph
on daisy launches RADOS, and the new OSD registers with the existing RADOS cluster. The final step is to modify the existing Crush map so the new OSD is used. You can type:
ceph osd crush add 4 osd.4 1.0 pool=default host=daisy
to do this. The new OSD is now part of the existing RADOS/Ceph cluster.
Conclusions
It isn't difficult to set up the combination of RADOS and Ceph. But this simple configuration doesn't leverage many of the exciting features that RADOS offers. For example, the Crush map functionality provides you with the option of deploying huge RADOS setups over multiple racks in the data center while offering intelligent failsafes.
Because RADOS also offers the option of dividing its storage into individual pools of variable sizes, you can achieve more granularity in terms of different tasks and target groups.
Additionally, I haven't looked at the RADOS front ends beyond Ceph. After all, Ceph is just one front end of many; in this case, it supports access to files in the object store via the Linux filesystem. However, more options for accessing the data on RADOS exist.
The RADOS block device, or rbd
for short, is a good choice when you need to support access to files in the object store at the block device level. For example, this would be the case for virtual machines that will typically accept block devices as a back end for virtual disks, thus avoiding slower solutions with disk images. In this way, you can exploit RADOS's potential as an all-encompassing storage system for large virtualization solutions while solving another problem in the context of the cloud.
Speaking of the cloud, besides rbd
, librados
provides various interfaces for HTTP access – for example, a variant compatible with Amazon's S3 and a Swift-compatible variant. A generic REST interface is also available. As you can see, RADOS includes a good selection of interfaces to the world outside.
At the time of writing, the RADOS Ceph components were still pre-series, but the developers were looking to release version 1.0, which could already be available as officially stable and "enterprise-ready."