Tools Galera Cluster Lead image: © Noel Powell, fotolia.com

Galera Cluster for MySQL

Cluster Time

The longer a MySQL database is in use, the less you can do without it, so as your organization grows, you will eventually face the question of availability. Galera Cluster for MySQL is one possible answer. By Oli Sennhauser

Why does the world need yet another MySQL clustering solution when several diverse approaches exist for increasing accessibility of MySQL databases? Because the traditional answers to the problem of clustering MySQL and replicating changes have significant limitations. For instance, the master/slave replication model can lead to delays that cause data loss, and active/passive failover solutions introduce significant complexity that can lead to misconfiguration and other problems. (See the box titled "Alternatives.")

Alternatives

Several alternatives support clustering for MySQL services; however, each of these options comes with some drawbacks.

MySQL Master/Slave Replication leads to a delay before the data reaches the slave because of asynchronous replication. If this delay is not handled well (e.g., inconsistent snapshots on the master, or setting the wrong binary log position after insertion), this could quickly cause data inconsistencies between the master and slave.

In the old, so-called statement-based replication (SBR), which is still often used, there is also a danger of inconsistencies through uncertain, non-deterministic queries. That can happen when, for example, the DELETE and UPDATE queries are used with LIMIT, or when they use LOAD_FILE(), UUID(), USER(), SYSDATE(), and other functions.

A crashing master or slave can cause gaps in the data stream or even corrupt the binary logs. Stopping slaves while temporary tables have not been fully processed can also cause unnoticeable inconsistencies. Finally, another cause of inconsistencies is when data is manipulated directly on the slave, whether intentionally or not. The situation gets even more dangerous when both nodes are to be written on for master/master replication. In this case, data inconsistencies are almost certain. If you don't believe it, you only need to compare the data on the master and slave afterward.

Active/Passive Failover Cluster, which is frequently used for databases, has the disadvantage that managing a whole cluster is quite complex and can easily lead to difficulties when it comes to restoring a defective cluster.

The administrator must be familiar with the cluster manager (Pacemaker and Corosync or Heartbeat) and should know exactly how to deal with a shared disk system. When using DRBD, the replication between the disk systems of both cluster nodes occurs synchronously.

The administrator must also have an exact understanding of this technology. Instead, many administrators manage the installation fairly well, but when trouble comes up, they don't have enough knowledge to react correctly. Beyond that, when DRBD is involved, the I/O system can become a bottleneck. Scalability is then only possible with MySQL master/slave replication.

MySQL Cluster (NDB Cluster) has the disadvantage that it cannot simply be linked with any application. MySQL Cluster originated under the influence of Telco, where mostly primary key queries are used (key/value store). It is very well suited for this and also has great scalability. In most web applications, however (web shops, CMS, CRM, etc.), much more complex queries dominate, usually with join clauses.

In such a case, MySQL Cluster – because of its network-database (NDB) architecture – does not perform very well because each join has to use network connections.

To compensate for this, the application side needs to be modified to translate the joins into individual queries. The whole application might need a new design, because that is the only way to utilize the full potential of the NDB Cluster. Additionally, you must learn how to deal with a spread-out cluster that consists of three different components (data nodes, management nodes, and SQL nodes). For the administrator, this means even more complexity.

Galera Cluster is designed to avoid issues associated with other MySQL clustering alternatives. Galera offers an uncomplicated solution for achieving high-availability and scaling read requests.

Introducing Galera Cluster

Galera Cluster is a synchronous-replication cluster based on the InnoDB storage engine. Galera uses a real active/active multimaster topology, which means the cluster can simultaneously read from and write to all nodes. As soon as one node breaks down, the other nodes register it automatically. When a new node joins the cluster, it is automatically integrated and synchronized.

In contrast to master/slave replication and solutions, Galera Cluster can replicate simultaneously at row level, therefore achieving a much faster data flow compared with MySQL master/slave replication. Write access is scalable to a certain extent; read access scales indefinitely with the number of cluster nodes. In contrast to MySQL replication, nodes cannot lag behind, so there is no need to control the data for currency, and transactions cannot be lost.

The underlying dependence on the InnoDB storage engine means that all tables must be available as InnoDB tables. If for some reason the tables are not available in InnoDB format, Galera is only of limited usefulness, but usually tables can be converted to InnoDB without any problems.

Much Light – But Also Shadow?

With so many advantages, the question naturally arises: Where's the catch? Galera Cluster does indeed bring some disadvantages. Because MySQL does not offer all the necessary functions itself, the MySQL code must be patched in. The replication itself is done by a plugin. Fortunately, the user can resort to ready-made binaries that the company Codership provides. Alternatively, the system supports binaries from Percona or MariaDB .

A further disadvantage of Galera Cluster is that an increased number of conflicts arise in the case of hotspots (small tables with a very high rate of change). In this scenario, the application increasingly receives deadlock messages and must deal with them. One last disadvantage is that, during the initial complete synchronization, when a node is first integrated into the cluster, the node that provides the data (the so-called donor node) is not available for any other access. That is also one of the main reasons Galera Cluster always should be configured with at least three nodes (Figure 1).

Figure 1: Overview of the cluster with intermediate load balancer. The cluster should consist of at least three nodes.

Topology and Function

The recommended setup for Galera Cluster consists of three nodes (servers). These server nodes can be either physical or virtual. (The servers can even be spread out over different data centers, although reaction response times will understandably suffer if the servers are located remotely.) To achieve high availability for the whole system, a load balancer is added between the system and the Galera Cluster.

The patches from Codership intervene in the transaction and replication mechanism of MySQL. When a COMMIT command occurs, the write-set (i.e., the transaction plus metadata) from the so-called master node (the node that receives the transaction) is transmitted to all other nodes (the slaves) in the cluster. The cluster then generates a global transactions ID (GTID). With it, each node (including the master) can validate the write-set to determine whether the write-set can be applied or not. If the validation is successful, the write-set is applied. Otherwise, the slave rejects the write-set and the master performs a rollback.

When the slave node cannot apply the write-set (e.g., because the disk is full), it is excluded from the cluster and must terminate itself to guarantee that no data inconsistencies arise. Upon re-entry into the cluster, it must synchronize itself with the other nodes with a so-called state snapshot transfer (SST) to achieve a consistent data state. The SST is a complete transmission of the data from a so-called donor node to the receiving node newly joining the cluster. During an SST, the donor is not available for SQL queries. This is one reason why a Galera setup with only two nodes is not recommended.

An SST can be accomplished with mysqldump, rsync, or XtraBackup (2.0 and up). With a very large data volume (as a thumb rule: data volume > RAM), mysqldump is no longer the suitable SST method because of the long processing time.

Galera Cluster 2.0 includes the so-called incremental state transfer (IST). IST makes it possible for the cluster to transmit and apply only the increments on restart, which naturally can be done much quicker. This feature is especially interesting when the Galera Cluster is deployed in a WAN, and transmitting the complete data volume would take forever because of the narrow bandwidth.

Installation

Galera Cluster runs flawlessly both on physical machines and on virtual systems. The recommended setup consists of three nodes. Accordingly, an administrator who wants to test the cluster can start three Linux nodes on any acceptable virtualization solution. Ideally, each node will have two interfaces: one for internal communication in the node and one for communication with the applications. The following example uses CentOS 6.2 with the newest updates.

At first, the installer should make certain that all the nodes of the future cluster are visible. IP addresses for the following configuration are shown in Table 1.

Tabelle 1: IP Addresses

Node	Internal Interface	External Interface
node1	192.168.56.101 (eth1)	10.0.2.1 (eth0)
node2	192.168.56.102 (eth1)	10.0.2.2 (eth0)
node3	192.168.56.103 (eth1)	10.0.2.3 (eth0)

From node1, test the network connection to the other two nodes:

node1> ping 192.168.56.102 -I eth1
node1> ping 192.168.56.103 -I eth1

And then do the same in the other direction.

In order to avoid conflicts, it is a good idea to remove any MySQL 5.1 packages that might be installed:

rpm -e mysql-libs-5.1.61-1.el6_2.1.x86_64

Subsequently, then install the relevant MySQL 5.5 packages [1] again:

rpm -i MySQL-shared-compat-5.5.22-2.el6.x86_64.rpm
rpm -i MySQL-client-5.5.22-1.el6.x86_64.rpm

With CentOS 6, the following additional package is necessary:

yum install openssl098e.x86_64

The easiest way to obtain the Galera binaries is through the Codership website [2]. You will need both the patched MySQL binaries as well as the Galera replication plugin. To get them, enter the following commands:

wget https://launchpad.net/codership-mysql/5.5/5.5.20-23.4/+download/MySQL-server-5.5.20_wsrep_23.4-1.rhel5.x86_64.rpm
wget https://launchpad.net/galera/2.x/23.2.0/+download/galera-23.2.0-1.rhel5.x86_64.rpm
rpm -i MySQL-server-5.5.20_wsrep_23.4-1.rhel5.x86_64.rpm
rpm -i galera-23.2.0-1.rhel5.x86_64.rpm

The next step is to install MySQL and make sure the database runs correctly by entering the start command:

/etc/init.d/mysql start
Starting MySQL.. SUCCESS!

In Listing 1, you can see that MySQL is running on all nodes.

Listing 1: MySQL Is Running

 mysql> status;
 --------------
 mysql  Ver 14.14 Distrib 5.5.22, for Linux (x86_64) using readline 5.1
 ...
 Server version:         5.5.20 MySQL Community Server (GPL),
                         wsrep_23.4.r3713, wsrep_23.4.r3713
 Protocol version:       10
 Connection:             Localhost via UNIX socket

The command in Listing 2 shows that Galera replication is not yet ready.

Listing 2: No Replication

 SHOW GLOBAL VARIABLES LIKE 'wsrep_pro%';
 +------------------------+-------+
 | Variable_name          | Value |
 +------------------------+-------+
 | wsrep_provider         | none  |
 | wsrep_provider_options |       |
 +------------------------+-------+

Each node that wants to join the cluster must import the initial data set from the cluster. As mentioned earlier, this process is called a state snapshot transfer (SST).

A user account is necessary to perform an SST, and it is best that you do not to use the root MySQL user; therefore, you will need to set up a new user on all three nodes as follows:

GRANT ALL PRIVILEGES ON *.* TO 'sst'@'localhost'
IDENTIFIED BY 'secret';
GRANT ALL PRIVILEGES ON *.* TO 'sst'@'%'
IDENTIFIED BY 'secret';

Configuring the Galera Cluster

As usual in MySQL, the Galera Cluster is configured in the /etc/my.cnf file. The parameters that affect Galera begin with wsrep_. Some additional MySQL parameters also must be taken into account. Enter the parameters in Listing 3 in the my.cnf file under the [mysqld] section.

Listing 3: MySQL Parameters

 #
 # Galera-specifisc MySQL Parameters
 #
 binlog_format                  = row
 default_storage_engine         = InnoDB
 innodb_autoinc_lock_mode       = 2
 innodb_locks_unsafe_for_binlog = 1
 query_cache_size               = 0
 query_cache_type               = 0

For the Galera Cluster, the row bases replication format (RBR), which was added in MySQL 5.1, is mandatory. As mentioned earlier, Galera Cluster only runs with InnoDB tables. Because the query cache runs locally for each node and the caches do not communicate their content with each other, it must be deactivated to ensure data consistency.

The minimal Galera Cluster configuration consists of the parameters shown in Listing 4, which also belong in the [mysqld] section.

Listing 4: Galera Parameters

 #
 # WSREP (Galera) Parameters
 #
 #wsrep_provider             = none
 wsrep_provider              = /usr/lib64/galera/libgalera_smm.so
 wsrep_cluster_name          = "Galera-2.0 wsrep-23.4"
 wsrep_node_name             = "Node 1"
 wsrep_node_address          = 192.168.56.101
 wsrep_node_incoming_address = 10.0.2.1
 #wsrep_cluster_address       = "gcomm://192.168.56.103"
 wsrep_cluster_address       = "gcomm://"
 #
 # WSREP State Snapshot Transfer Parameters
 #
 wsrep_sst_method            = mysqldump
 wsrep_sst_auth              = sst:secret

The parameter wsrep_provider in Listing 4 tells MySQL where to find the Galera plugin. wsrep_cluster_name gives a cluster a unique name in the network. This parameter serves, above all, to prevent a completely foreign node from accidentally joining the cluster. wsrep_cluster_name must be the same on all nodes of the cluster.

The parameter wsrep_node_name gives a node a unique name. This parameter must be different for each of the three nodes. The parameter wsrep_cluster_address tells the node where it can find one of its colleagues, so it can be integrated into the cluster. On starting the first node, naturally, no other node is available, so the entry must read gcomm://. All other nodes can refer to this node when starting. However, if the first node restarts, it would attempt to build a new cluster with gcomm://. Therefore, after the first startup of this node, wsrep_cluster_address must be set to another node.

The following procedure is recommend when first starting the Galera Cluster:

node1: gcomm://
node2: gcomm://192.168.56.101
node3: gcomm://192.168.56.102

Subsequently, the my.cnf from node1 needs to be set accordingly, as follows:

node1: gcomm://192.168.56.103

Now you can start and stop each node at will. But when the cluster needs to be newly initialized, the first node would again need to be started with gcomm://.

The parameters wsrep_sst_method and wsrep_sst_auth tell the Galera Cluster which method it should use for an SST and with which user the password has access.

If more than one interface is used, wsrep_sst_receive_address should refer to the interface over which the SST should be received (and sent). Otherwise, it could happen that Galera guesses the wrong interface and the SST fails.

After all three nodes have been stopped, activate wsrep_provider in all three my.cnf files and start the first node:

/etc/init.d/mysql start

For security reasons, the MySQL error log should be checked for any problems under /var/lib/mysql/node1.err. When error messages like the messages in Listing 5 come up, SELinux is turned on. It would be better to turn it off temporarily with

Listing 5: Error Messages

 sh: /sbin/ifconfig: Permission denied
 120405  8:29:02 [ERROR] WSREP: Failed to read output of: '/sbin/ifconfig | \
 grep -m1 -1 -E '^[a-z]?eth[0-9]' | tail -n 1 | awk '{ print $2 }' | awk -F : '{ print $2 }''
 ...
 120405  8:29:02 [ERROR] WSREP: Permission denied
 120405  8:29:02 [ERROR] WSREP: failed to open gcomm backend connection: 13:
 error while trying to listen 'tcp://0.0.0.0:4567?socket.non_blocking=1', asio
 error 'Permission denied': 13 (Permission denied)
          at gcomm/src/asio_tcp.cpp:listen():750
 120405  8:29:02 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: \
 Failed to open backend connection: -13 (Permission denied)
 120405  8:29:02 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1284: Failed to open
 channel 'Galera-2.0 wsrep-23.4' at 'gcomm://': -13 (Permission denied)
 120405  8:29:02 [ERROR] WSREP: gcs connect failed: Permission denied
 120405  8:29:02 [ERROR] WSREP: wsrep::connect() failed: 6
 120405  8:29:02 [ERROR] Aborting
 120405  8:29:02 [Note] WSREP: Service disconnected.
 120405  8:29:03 [Note] WSREP: Some threads may fail to exit.

echo 0 >/selinux/enforce

and permanently in the /etc/selinux/config file:

SELINUX=permissive

Once the Galera node has been started successfully, you will receive messages like the messages in Listing 6.

Listing 6: Successful Start

 mysql> SHOW GLOBAL STATUS LIKE 'wsrep%';
 +----------------------------+--------------------------------------+
 | Variable_name              | Value                                |
 +----------------------------+--------------------------------------+
 | wsrep_local_state          | 4                                    |
 | wsrep_local_state_comment  | Synced (6)                           |
 | wsrep_cluster_conf_id      | 1                                    |
 | wsrep_cluster_size         | 1                                    |
 | wsrep_cluster_status       | Primary                              |
 | wsrep_connected            | ON                                   |
 | wsrep_local_index          | 0                                    |
 | wsrep_ready                | ON                                   |
 +----------------------------+--------------------------------------+

The node is connected, ready, and synced. The cluster size is currently 1 and the local ID (index) is 0. Now you can start the second node. For a short time, the status of node1 will appear as Donor:

| wsrep_local_state_comment  | Donor (+) |

This designation means that it has been chosen to transmit the data to node2 for the SST. Once all nodes have been started successfully, the status information will appear, as in Listing 7.

Listing 7: Cluster Is Running

 mysql> SHOW GLOBAL STATUS LIKE 'wsrep_cluster_%';
 +----------------------------+--------------------------------------+
 | Variable_name              | Value                                |
 +----------------------------+--------------------------------------+
 | wsrep_cluster_conf_id      | 3                                    |
 | wsrep_cluster_size         | 3                                    |
 +----------------------------+--------------------------------------+

Afterward, don't forget to change the wsrep_cluster_address entry on the first node:

wsrep_cluster_address = "gcomm://192.168.56.103"
#wsrep_cluster_address = "gcomm://"

Otherwise, upon a restart of node1, the node will not be able to find the cluster and will try to build a new cluster of its own.

With Load Balancers

From the back-end perspective of the database, a high-availability database cluster now exists, but from the application perspective, things appear quite different.

Assuming the application connects itself with node2 and then breaks down, the application must automatically connect itself to another node in the cluster. This is best done with a load balancer. Several possibilities exist for load balancing:

You can build the load-balancing mechanism directly into the application.
With Java or PHP applications, you can use the failover functionality of the corresponding connector (Connector/J [3], mysqlnd-ms [4]).
If you can't make changes in the application – or you don't want to – you can put a load balancer between the nodes and application by using either a hardware load balancer or a software load balancer (e.g., Pen [5], GLB [6], HAProxy [7], LVS [8]), or MySQL Proxy [9]).

You can locate the load balancer together with the application (on the application server) or, in larger setups, on a central server of its own (Figure 2).

Figure 2: Two fundamental options for where to place a load balancer: on the application server or on a server of its own.

If you need to manage a large number of applications or servers, it is a good idea to follow the second approach. But if you only have a few applications, less effort (and less hardware) is needed for installing the load balancer and application on the same machine.

Galera Load Balancer

One option for load balancing is the Galera Load Balancer (GLB), which you can install as follows:

wget http://www.codership.com/files/glb/glb-0.7.4.tar.gz
tar xf glb-0.7.4.tar.gz
cd glb-0.7.4
./configure
make
make install

To start GLB, follow the steps shown in Listing 8. Then, you can query the GLB, as shown in Listing 9. Listing 10 shows how to remove node 192.168.56.101 for maintenance.

Listing 8: Starting the Load Balancer

 ./glbd --daemon --threads 6 --control 127.0.0.1:4444 127.0.0.1:3306 \
 192.168.56.101:3306:1 192.168.56.102:3306:1 192.168.56.103:3306:1
 Incoming address:       127.0.0.1:3306 , control FIFO: /tmp/glbd.fifo
 Control  address:        127.0.0.1:4444
 Number of threads: 6, source tracking: OFF, verbose: OFF, daemon: YES
 Destinations: 3
    0:  192.168.56.101:3306 , w: 1.000
    1:  192.168.56.102:3306 , w: 1.000
    2:  192.168.56.103:3306 , w: 1.000

Listing 9: Querying the Load Balancer

 echo getinfo | nc -q 1 127.0.0.1 4444
 Router:
 ----------------------------------------------------
        Address       :   weight   usage   conns
 192.168.56.101:3306  :    1.000   0.667     2
 192.168.56.102:3306  :    1.000   0.500     1
 192.168.56.103:3306  :    1.000   0.500     1
 ----------------------------------------------------
 Destinations: 3, total connections: 4
 and
 echo getstats | nc -q 1 127.0.0.1 4444
 in: 37349 out: 52598 recv: 89947 / 1989 send: 89947 / 1768
 conns: 225 / 4 poll: 1989 / 0 / 1989 elapsed: 76.59987

Listing 10: Removing a Node

 echo 192.168.56.101:3306:0 | nc -q 1 127.0.0.1 4444
 echo getinfo | nc -q 1 127.0.0.1 4444
 Router:
 ----------------------------------------------------
         Address       :   weight   usage   conns
  192.168.56.101:3306  :    0.000   1.000     0
  192.168.56.102:3306  :    1.000   0.667     2
  192.168.56.103:3306  :    1.000   0.667     2
 ----------------------------------------------------
 Destinations: 3, total connections: 4

In general, you can remove and add nodes with the following two commands:

echo 192.168.56.103:3306:-1 | nc -q 1 127.0.0.1 4444
echo 192.168.56.103:3306:2 | nc -q 1 127.0.0.1 4444

I experienced some inconsistencies while trying out Galera Load Balancer and reported the problems to the developer.

Pen

Because I encountered problems with GLB, I took a look at the TCP load balancer Pen. Listing 11 shows how to install the Pen load balancer, and Listing 12 shows a query to Pen.

Listing 11: Installing and Starting Pen

 wget http://siag.nu/pub/pen/pen-0.18.0.tar.gz
 tar xf /download/pen-0.18.0.tar.gz
 cd pen-0.18.0/
 ./configure
 make
 make install
 Started with:
 ./pen -r -p pen.pid -C localhost:4444 -S 3 \
    localhost:3306 192.168.56.101:3306:100 \
    192.168.56.102:3306:100 192.168.56.103:3306:100

Listing 12: Querying the Pen Load Balancer

 ./penctl localhost:4444 mode
 no block no delayed_forward no hash roundrobin no stubborn weight no prio
 ./penctl localhost:4444 servers
 0 addr 192.168.56.101 port 3306 conn 1 max 100 hard 0 weight 0 prio 0 sx 1245356 rx 1753808
 1 addr 192.168.56.102 port 3306 conn 1 max 100 hard 0 weight 0 prio 0 sx 1245525 rx 1754046
 2 addr 192.168.56.103 port 3306 conn 1 max 100 hard 0 weight 0 prio 0 sx 1245525 rx 1754046

Pen can save its status in an HTML file, which you can open with a browser or integrate in your monitoring infrastructure.

./penctl localhost:4444 status > /tmp/pen.html

New servers cannot be added while the system is in use. The Pen developer recommends preparing several extra servers and then activating them when they are needed. For this, you must stop Pen correctly:

kill -TERM `cat pen.pid`

If a back-end server or Galera node breaks down completely, Pen will register it and reroute traffic accordingly. If a Galera node is performing an SST (i.e., not available for the application) or if maintenance work needs to be done on the respective node, deactivate the node in Pen by putting it on a blacklist for a specific period of time:

./penctl localhost:4444 server 0 blacklist 7200

Conclusion

The Galera Cluster is now up and running. If the load balancer is not inserted directly in front of the application, you should also set up the load balancer for high availability. You can achieve this with vrrpd [10] or other common cluster solutions.