Galera Cluster for MySQL
Cluster Time
Why does the world need yet another MySQL clustering solution when several diverse approaches exist for increasing accessibility of MySQL databases? Because the traditional answers to the problem of clustering MySQL and replicating changes have significant limitations. For instance, the master/slave replication model can lead to delays that cause data loss, and active/passive failover solutions introduce significant complexity that can lead to misconfiguration and other problems. (See the box titled "Alternatives.")
Galera Cluster is designed to avoid issues associated with other MySQL clustering alternatives. Galera offers an uncomplicated solution for achieving high-availability and scaling read requests.
Introducing Galera Cluster
Galera Cluster is a synchronous-replication cluster based on the InnoDB storage engine. Galera uses a real active/active multimaster topology, which means the cluster can simultaneously read from and write to all nodes. As soon as one node breaks down, the other nodes register it automatically. When a new node joins the cluster, it is automatically integrated and synchronized.
In contrast to master/slave replication and solutions, Galera Cluster can replicate simultaneously at row level, therefore achieving a much faster data flow compared with MySQL master/slave replication. Write access is scalable to a certain extent; read access scales indefinitely with the number of cluster nodes. In contrast to MySQL replication, nodes cannot lag behind, so there is no need to control the data for currency, and transactions cannot be lost.
The underlying dependence on the InnoDB storage engine means that all tables must be available as InnoDB tables. If for some reason the tables are not available in InnoDB format, Galera is only of limited usefulness, but usually tables can be converted to InnoDB without any problems.
Much Light – But Also Shadow?
With so many advantages, the question naturally arises: Where's the catch? Galera Cluster does indeed bring some disadvantages. Because MySQL does not offer all the necessary functions itself, the MySQL code must be patched in. The replication itself is done by a plugin. Fortunately, the user can resort to ready-made binaries that the company Codership provides. Alternatively, the system supports binaries from Percona or MariaDB .
A further disadvantage of Galera Cluster is that an increased number of conflicts arise in the case of hotspots (small tables with a very high rate of change). In this scenario, the application increasingly receives deadlock messages and must deal with them. One last disadvantage is that, during the initial complete synchronization, when a node is first integrated into the cluster, the node that provides the data (the so-called donor node) is not available for any other access. That is also one of the main reasons Galera Cluster always should be configured with at least three nodes (Figure 1).
Topology and Function
The recommended setup for Galera Cluster consists of three nodes (servers). These server nodes can be either physical or virtual. (The servers can even be spread out over different data centers, although reaction response times will understandably suffer if the servers are located remotely.) To achieve high availability for the whole system, a load balancer is added between the system and the Galera Cluster.
The patches from Codership intervene in the transaction and replication mechanism of MySQL. When a COMMIT
command occurs, the write-set (i.e., the transaction plus metadata) from the so-called master node (the node that receives the transaction) is transmitted to all other nodes (the slaves) in the cluster. The cluster then generates a global transactions ID (GTID). With it, each node (including the master) can validate the write-set to determine whether the write-set can be applied or not. If the validation is successful, the write-set is applied. Otherwise, the slave rejects the write-set and the master performs a rollback.
When the slave node cannot apply the write-set (e.g., because the disk is full), it is excluded from the cluster and must terminate itself to guarantee that no data inconsistencies arise. Upon re-entry into the cluster, it must synchronize itself with the other nodes with a so-called state snapshot transfer (SST) to achieve a consistent data state. The SST is a complete transmission of the data from a so-called donor node to the receiving node newly joining the cluster. During an SST, the donor is not available for SQL queries. This is one reason why a Galera setup with only two nodes is not recommended.
An SST can be accomplished with mysqldump
, rsync
, or XtraBackup (2.0 and up). With a very large data volume (as a thumb rule: data volume > RAM), mysqldump
is no longer the suitable SST method because of the long processing time.
Galera Cluster 2.0 includes the so-called incremental state transfer (IST). IST makes it possible for the cluster to transmit and apply only the increments on restart, which naturally can be done much quicker. This feature is especially interesting when the Galera Cluster is deployed in a WAN, and transmitting the complete data volume would take forever because of the narrow bandwidth.
Installation
Galera Cluster runs flawlessly both on physical machines and on virtual systems. The recommended setup consists of three nodes. Accordingly, an administrator who wants to test the cluster can start three Linux nodes on any acceptable virtualization solution. Ideally, each node will have two interfaces: one for internal communication in the node and one for communication with the applications. The following example uses CentOS 6.2 with the newest updates.
At first, the installer should make certain that all the nodes of the future cluster are visible. IP addresses for the following configuration are shown in Table 1.
Tabelle 1: IP Addresses
Node |
Internal Interface |
External Interface |
---|---|---|
node1 |
192.168.56.101 (eth1) |
10.0.2.1 (eth0) |
node2 |
192.168.56.102 (eth1) |
10.0.2.2 (eth0) |
node3 |
192.168.56.103 (eth1) |
10.0.2.3 (eth0) |
From node1
, test the network connection to the other two nodes:
node1> ping 192.168.56.102 -I eth1 node1> ping 192.168.56.103 -I eth1
And then do the same in the other direction.
In order to avoid conflicts, it is a good idea to remove any MySQL 5.1 packages that might be installed:
rpm -e mysql-libs-5.1.61-1.el6_2.1.x86_64
Subsequently, then install the relevant MySQL 5.5 packages [1] again:
rpm -i MySQL-shared-compat-5.5.22-2.el6.x86_64.rpm rpm -i MySQL-client-5.5.22-1.el6.x86_64.rpm
With CentOS 6, the following additional package is necessary:
yum install openssl098e.x86_64
The easiest way to obtain the Galera binaries is through the Codership website [2]. You will need both the patched MySQL binaries as well as the Galera replication plugin. To get them, enter the following commands:
wget https://launchpad.net/codership-mysql/5.5/5.5.20-23.4/+download/MySQL-server-5.5.20_wsrep_23.4-1.rhel5.x86_64.rpm wget https://launchpad.net/galera/2.x/23.2.0/+download/galera-23.2.0-1.rhel5.x86_64.rpm rpm -i MySQL-server-5.5.20_wsrep_23.4-1.rhel5.x86_64.rpm rpm -i galera-23.2.0-1.rhel5.x86_64.rpm
The next step is to install MySQL and make sure the database runs correctly by entering the start
command:
/etc/init.d/mysql start Starting MySQL.. SUCCESS!
In Listing 1, you can see that MySQL is running on all nodes.
Listing 1: MySQL Is Running
mysql> status; -------------- mysql Ver 14.14 Distrib 5.5.22, for Linux (x86_64) using readline 5.1 ... Server version: 5.5.20 MySQL Community Server (GPL), wsrep_23.4.r3713, wsrep_23.4.r3713 Protocol version: 10 Connection: Localhost via UNIX socket
The command in Listing 2 shows that Galera replication is not yet ready.
Listing 2: No Replication
SHOW GLOBAL VARIABLES LIKE 'wsrep_pro%'; +------------------------+-------+ | Variable_name | Value | +------------------------+-------+ | wsrep_provider | none | | wsrep_provider_options | | +------------------------+-------+
Each node that wants to join the cluster must import the initial data set from the cluster. As mentioned earlier, this process is called a state snapshot transfer (SST).
A user account is necessary to perform an SST, and it is best that you do not to use the root MySQL user; therefore, you will need to set up a new user on all three nodes as follows:
GRANT ALL PRIVILEGES ON *.* TO 'sst'@'localhost' IDENTIFIED BY 'secret'; GRANT ALL PRIVILEGES ON *.* TO 'sst'@'%' IDENTIFIED BY 'secret';
Configuring the Galera Cluster
As usual in MySQL, the Galera Cluster is configured in the /etc/my.cnf
file. The parameters that affect Galera begin with wsrep_
. Some additional MySQL parameters also must be taken into account. Enter the parameters in Listing 3 in the my.cnf
file under the [mysqld]
section.
Listing 3: MySQL Parameters
# # Galera-specifisc MySQL Parameters # binlog_format = row default_storage_engine = InnoDB innodb_autoinc_lock_mode = 2 innodb_locks_unsafe_for_binlog = 1 query_cache_size = 0 query_cache_type = 0
For the Galera Cluster, the row bases replication format (RBR), which was added in MySQL 5.1, is mandatory. As mentioned earlier, Galera Cluster only runs with InnoDB tables. Because the query cache runs locally for each node and the caches do not communicate their content with each other, it must be deactivated to ensure data consistency.
The minimal Galera Cluster configuration consists of the parameters shown in Listing 4, which also belong in the [mysqld]
section.
Listing 4: Galera Parameters
# # WSREP (Galera) Parameters # #wsrep_provider = none wsrep_provider = /usr/lib64/galera/libgalera_smm.so wsrep_cluster_name = "Galera-2.0 wsrep-23.4" wsrep_node_name = "Node 1" wsrep_node_address = 192.168.56.101 wsrep_node_incoming_address = 10.0.2.1 #wsrep_cluster_address = "gcomm://192.168.56.103" wsrep_cluster_address = "gcomm://" # # WSREP State Snapshot Transfer Parameters # wsrep_sst_method = mysqldump wsrep_sst_auth = sst:secret
The parameter wsrep_provider
in Listing 4 tells MySQL where to find the Galera plugin. wsrep_cluster_name
gives a cluster a unique name in the network. This parameter serves, above all, to prevent a completely foreign node from accidentally joining the cluster. wsrep_cluster_name
must be the same on all nodes of the cluster.
The parameter wsrep_node_name
gives a node a unique name. This parameter must be different for each of the three nodes. The parameter wsrep_cluster_address
tells the node where it can find one of its colleagues, so it can be integrated into the cluster. On starting the first node, naturally, no other node is available, so the entry must read gcomm://
. All other nodes can refer to this node when starting. However, if the first node restarts, it would attempt to build a new cluster with gcomm://
. Therefore, after the first startup of this node, wsrep_cluster_address
must be set to another node.
The following procedure is recommend when first starting the Galera Cluster:
node1: gcomm:// node2: gcomm://192.168.56.101 node3: gcomm://192.168.56.102
Subsequently, the my.cnf
from node1
needs to be set accordingly, as follows:
node1: gcomm://192.168.56.103
Now you can start and stop each node at will. But when the cluster needs to be newly initialized, the first node would again need to be started with gcomm://
.
The parameters wsrep_sst_method
and wsrep_sst_auth
tell the Galera Cluster which method it should use for an SST and with which user the password has access.
If more than one interface is used, wsrep_sst_receive_address
should refer to the interface over which the SST should be received (and sent). Otherwise, it could happen that Galera guesses the wrong interface and the SST fails.
After all three nodes have been stopped, activate wsrep_provider
in all three my.cnf
files and start the first node:
/etc/init.d/mysql start
For security reasons, the MySQL error log should be checked for any problems under /var/lib/mysql/node1.err
. When error messages like the messages in Listing 5 come up, SELinux is turned on. It would be better to turn it off temporarily with
Listing 5: Error Messages
sh: /sbin/ifconfig: Permission denied 120405 8:29:02 [ERROR] WSREP: Failed to read output of: '/sbin/ifconfig | \ grep -m1 -1 -E '^[a-z]?eth[0-9]' | tail -n 1 | awk '{ print $2 }' | awk -F : '{ print $2 }'' ... 120405 8:29:02 [ERROR] WSREP: Permission denied 120405 8:29:02 [ERROR] WSREP: failed to open gcomm backend connection: 13: error while trying to listen 'tcp://0.0.0.0:4567?socket.non_blocking=1', asio error 'Permission denied': 13 (Permission denied) at gcomm/src/asio_tcp.cpp:listen():750 120405 8:29:02 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: \ Failed to open backend connection: -13 (Permission denied) 120405 8:29:02 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1284: Failed to open channel 'Galera-2.0 wsrep-23.4' at 'gcomm://': -13 (Permission denied) 120405 8:29:02 [ERROR] WSREP: gcs connect failed: Permission denied 120405 8:29:02 [ERROR] WSREP: wsrep::connect() failed: 6 120405 8:29:02 [ERROR] Aborting 120405 8:29:02 [Note] WSREP: Service disconnected. 120405 8:29:03 [Note] WSREP: Some threads may fail to exit.
echo 0 >/selinux/enforce
and permanently in the /etc/selinux/config
file:
SELINUX=permissive
Once the Galera node has been started successfully, you will receive messages like the messages in Listing 6.
Listing 6: Successful Start
mysql> SHOW GLOBAL STATUS LIKE 'wsrep%'; +----------------------------+--------------------------------------+ | Variable_name | Value | +----------------------------+--------------------------------------+ | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced (6) | | wsrep_cluster_conf_id | 1 | | wsrep_cluster_size | 1 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_index | 0 | | wsrep_ready | ON | +----------------------------+--------------------------------------+
The node is connected, ready, and synced. The cluster size is currently 1
and the local ID (index) is 0
. Now you can start the second node. For a short time, the status of node1
will appear as Donor
:
| wsrep_local_state_comment | Donor (+) |
This designation means that it has been chosen to transmit the data to node2
for the SST. Once all nodes have been started successfully, the status information will appear, as in Listing 7.
Listing 7: Cluster Is Running
mysql> SHOW GLOBAL STATUS LIKE 'wsrep_cluster_%'; +----------------------------+--------------------------------------+ | Variable_name | Value | +----------------------------+--------------------------------------+ | wsrep_cluster_conf_id | 3 | | wsrep_cluster_size | 3 | +----------------------------+--------------------------------------+
Afterward, don't forget to change the wsrep_cluster_address
entry on the first node:
wsrep_cluster_address = "gcomm://192.168.56.103" #wsrep_cluster_address = "gcomm://"
Otherwise, upon a restart of node1
, the node will not be able to find the cluster and will try to build a new cluster of its own.
With Load Balancers
From the back-end perspective of the database, a high-availability database cluster now exists, but from the application perspective, things appear quite different.
Assuming the application connects itself with node2
and then breaks down, the application must automatically connect itself to another node in the cluster. This is best done with a load balancer. Several possibilities exist for load balancing:
- You can build the load-balancing mechanism directly into the application.
- With Java or PHP applications, you can use the failover functionality of the corresponding connector (Connector/J [3],
mysqlnd-ms
[4]). - If you can't make changes in the application – or you don't want to – you can put a load balancer between the nodes and application by using either a hardware load balancer or a software load balancer (e.g., Pen [5], GLB [6], HAProxy [7], LVS [8]), or MySQL Proxy [9]).
You can locate the load balancer together with the application (on the application server) or, in larger setups, on a central server of its own (Figure 2).
If you need to manage a large number of applications or servers, it is a good idea to follow the second approach. But if you only have a few applications, less effort (and less hardware) is needed for installing the load balancer and application on the same machine.
Galera Load Balancer
One option for load balancing is the Galera Load Balancer (GLB), which you can install as follows:
wget http://www.codership.com/files/glb/glb-0.7.4.tar.gz tar xf glb-0.7.4.tar.gz cd glb-0.7.4 ./configure make make install
To start GLB, follow the steps shown in Listing 8. Then, you can query the GLB, as shown in Listing 9. Listing 10 shows how to remove node 192.168.56.101 for maintenance.
Listing 8: Starting the Load Balancer
./glbd --daemon --threads 6 --control 127.0.0.1:4444 127.0.0.1:3306 \ 192.168.56.101:3306:1 192.168.56.102:3306:1 192.168.56.103:3306:1 Incoming address: 127.0.0.1:3306 , control FIFO: /tmp/glbd.fifo Control address: 127.0.0.1:4444 Number of threads: 6, source tracking: OFF, verbose: OFF, daemon: YES Destinations: 3 0: 192.168.56.101:3306 , w: 1.000 1: 192.168.56.102:3306 , w: 1.000 2: 192.168.56.103:3306 , w: 1.000
Listing 9: Querying the Load Balancer
echo getinfo | nc -q 1 127.0.0.1 4444 Router: ---------------------------------------------------- Address : weight usage conns 192.168.56.101:3306 : 1.000 0.667 2 192.168.56.102:3306 : 1.000 0.500 1 192.168.56.103:3306 : 1.000 0.500 1 ---------------------------------------------------- Destinations: 3, total connections: 4 and echo getstats | nc -q 1 127.0.0.1 4444 in: 37349 out: 52598 recv: 89947 / 1989 send: 89947 / 1768 conns: 225 / 4 poll: 1989 / 0 / 1989 elapsed: 76.59987
Listing 10: Removing a Node
echo 192.168.56.101:3306:0 | nc -q 1 127.0.0.1 4444 echo getinfo | nc -q 1 127.0.0.1 4444 Router: ---------------------------------------------------- Address : weight usage conns 192.168.56.101:3306 : 0.000 1.000 0 192.168.56.102:3306 : 1.000 0.667 2 192.168.56.103:3306 : 1.000 0.667 2 ---------------------------------------------------- Destinations: 3, total connections: 4
In general, you can remove and add nodes with the following two commands:
echo 192.168.56.103:3306:-1 | nc -q 1 127.0.0.1 4444 echo 192.168.56.103:3306:2 | nc -q 1 127.0.0.1 4444
I experienced some inconsistencies while trying out Galera Load Balancer and reported the problems to the developer.
Pen
Because I encountered problems with GLB, I took a look at the TCP load balancer Pen. Listing 11 shows how to install the Pen load balancer, and Listing 12 shows a query to Pen.
Listing 11: Installing and Starting Pen
wget http://siag.nu/pub/pen/pen-0.18.0.tar.gz tar xf /download/pen-0.18.0.tar.gz cd pen-0.18.0/ ./configure make make install Started with: ./pen -r -p pen.pid -C localhost:4444 -S 3 \ localhost:3306 192.168.56.101:3306:100 \ 192.168.56.102:3306:100 192.168.56.103:3306:100
Listing 12: Querying the Pen Load Balancer
./penctl localhost:4444 mode no block no delayed_forward no hash roundrobin no stubborn weight no prio ./penctl localhost:4444 servers 0 addr 192.168.56.101 port 3306 conn 1 max 100 hard 0 weight 0 prio 0 sx 1245356 rx 1753808 1 addr 192.168.56.102 port 3306 conn 1 max 100 hard 0 weight 0 prio 0 sx 1245525 rx 1754046 2 addr 192.168.56.103 port 3306 conn 1 max 100 hard 0 weight 0 prio 0 sx 1245525 rx 1754046
Pen can save its status in an HTML file, which you can open with a browser or integrate in your monitoring infrastructure.
./penctl localhost:4444 status > /tmp/pen.html
New servers cannot be added while the system is in use. The Pen developer recommends preparing several extra servers and then activating them when they are needed. For this, you must stop Pen correctly:
kill -TERM `cat pen.pid`
If a back-end server or Galera node breaks down completely, Pen will register it and reroute traffic accordingly. If a Galera node is performing an SST (i.e., not available for the application) or if maintenance work needs to be done on the respective node, deactivate the node in Pen by putting it on a blacklist for a specific period of time:
./penctl localhost:4444 server 0 blacklist 7200
Conclusion
The Galera Cluster is now up and running. If the load balancer is not inserted directly in front of the application, you should also set up the load balancer for high availability. You can achieve this with vrrpd
[10] or other common cluster solutions.