has_many :codes

Percona Galera MySQL Cluster

Published  

Why a MySQL cluster

This post describes how to set up a Percona Galera MySQL Cluster with synchronous replication and load balancing. I have been using MySQL for many years as the RDBMS of choice for most applications; it does have its quirks, and it may lack some features I wish it had (and that other relational databases such as PostgreSQL have), but generally speaking it works fairly well and has good performance and reliability; also, I am anyway more familiar with it than with other databases. While these days the buzz is mainly for the so called NoSQL/schemaless/document-store/key-value-store/you-name-it alternatives, in my opinion relational databases are still a good option in many cases and are often also easier to use.

For a while the typical solution I used to rely on to scale MySQL databases was based on asynchronous replication, partitioning, and sharding, depending on the case. However I got tired of slaves going out of sync, and sharding can be a very good or a very bad idea depending on how it is done and or how well you can guess and plan in advance how the data will be used. In the past I’ve also tried the ‘standard’ MySQL Cluster, multi master replication setups, and various proxying configurations; however none of these were 100% reliable or easy enough to setup and maintain. About a year ago I started playing with a different type of MySQL cluster based on the synchronous replication provided by the Galera plugin (by Codership – who are also based here in Finland); Galera enables virtually synchronous replication to allow for reading from/writing to any node; furthermore, it automatically handles node provisioning. Better performance than the ‘standard’ MySQL cluster, no more slaves out of sync, true multi master replication and scalability out of the box with very little maintenance. Yay!

Rather than using the plugin directly with the standard ‘distribution’ of MySQL, I prefer using Percona‘s own distribution which includes many optimisations and also the XtraDB storage engine, a drop in replacement for InnoDB that performs a lot better in many scenarios; in addition, Percona XtraDB Cluster also includes the Galera plugin, so you are ready to configure a new MySQL cluster in a very short time. You can find instructions on how to setup a MySQL cluster on Percona’s website as well, but here I’d like to add a few slightly different instructions on how to use packages you can download rather than using the repositories for your Linux distribution provided by Percona. The reason I prefer to use these packages is that in a couple cases I have noticed that the packages available for download are newer that those you’d install from the repositories. I will also be covering some firewalling and proxying configuration so to have a secure and highly available MySQL cluster.

I will assume here you want to set up a complete MySQL cluster from scratch; you can skip some steps as you wish if that’s not your case. I will also assume here you already have linux boxes with at least the basic OS up to date; the following instructions will work as they are with Debian based distros (I normally use Ubuntu).

SSH setup

First things first, let’s lock down each node by configuring SSH authentication and the firewall. We need to configure public key authentication and disable the weaker password based authentication; still from your client computer, copy your public key to your new server; there are various ways to do this but perhaps the easiest is with the utility ssh-copy-id already available with most distros (if you are on OSX and use Homebrew, you can install it with brew install ssh-copy-id). Assuming your first node is called something like node1:

ssh-copy-id -i ~/.ssh/id_dsa.pub node1

Test now the pub key authentication by SSH’ing into the box; you shouldn’t be required to enter your password this time. Next, if all looks good, edit /etc/ssh/sshd_config and change the port number defined at the top of the file with the port you want to use; then uncomment the line that has the setting PasswordAuthentication yes and change that setting to no so to force authentication with public key, which is more secure. Now restart SSH with

service ssh restart

making sure you don’t close your current terminal session until you have successfully tested the new configuration. Next, from your client computer, edit ~/.ssh/config and paste the following:

Host node1 # or whatever
  HostName … # ip or hostname of the box
  User … # user account you'll be using on the box
  Port … # custom port

Replace the placeholder text with the actual IP of the server, the username you’ll be using on the box and the SSH port you’ve chosen earlier; I recommend using a different port rather than the default one (22). Try now again to SSH into the box with just

ssh node1

You should be logged in if all went OK.

Firewall with iptables

For now, we’ll lock down the box with a restrictive iptables configuration; later we’ll open some port required for the MySQL cluster to function. Edit /etc/iptables.rules and paste the following:

# Generated by iptables-save v1.4.4 on Tue Feb 19 23:11:06 2013
*filter
:OUTPUT ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -d 127.0.0.0/8 ! -i lo -j DROP
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport <your SSH port> -j ACCEPT
-A INPUT -p icmp -m icmp --icmp-type 8 -j ACCEPT
-A INPUT -j LOG
-A INPUT -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -j ACCEPT
COMMIT
# Completed on Tue Feb 19 23:11:06 2013

this is the basic configuration I usually start with, then I open ports or make changes as required. To apply these firewall rules right away, run

iptables-restore < /etc/iptables.rules

To ensure these rules are also applied each time the server starts, edit /etc/network/interfaces and add post-up iptables-restore < /etc/iptables.rules:

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
address ...
netmask ...
broadcast ...
network ...
gateway ...
post-up iptables-restore < /etc/iptables.up.rules

Of course make sure you specify your correct network settings here.

Dependencies

I install the same dependencies required when installing Percona XtraDB MySQL Cluster from source code, so to be sure nothing is missed.

apt-get install curl build-essential flex bison automake autoconf \
bzr libtool cmake libaio-dev libncurses-dev zlib1g-dev libdbi-perl \
libnet-daemon-perl libplrpc-perl libaio1

There’s one more dependency you need, but it is only available from Percona’s repositories, so we need to add them to apt’s sources:

gpg --keyserver hkp://keys.gnupg.net --recv-keys 1C4CBDCDCD2EFD2A
gpg -a --export CD2EFD2A | sudo apt-key add -

Then edit /etc/apt/sources.list and append the following lines to enable these repositories:

deb http://repo.percona.com/apt lucid main
deb-src http://repo.percona.com/apt lucid main

Lastly, to install the remaining dependency:

apt-get update
apt-get install libmysqlclient18

Installing the Percona packages

It’s time now to install the Percona packages; you’ll need to install both packages for the Percona server and for Xtrabackup, a hot backup tool also from Percona. You will need Xtrabackup if you use this tool as the strategy to use for the provisioning of nodes in the MySQL cluster – more on this later.

You can download the packages to install the Percona server from here and the one required to install Xtrabackup from here. At the moment of this writing, the latest versions available are 5.5.29-23.7.2-389 for Percona server and 2.0.5-499 for Xtrabackup. I am using Ubuntu Lucid x86-amd64 so in the following example I am downloading the packages for this version:

cd /usr/local/src

wget http://www.percona.com/redir/downloads/Percona-XtraDB-Cluster/LATEST/deb/lucid/x86_64/percona-xtradb-cluster-common-5.5_5.5.29-23.7.2-389.lucid_all.deb
wget http://www.percona.com/redir/downloads/Percona-XtraDB-Cluster/LATEST/deb/lucid/x86_64/percona-xtradb-cluster-client-5.5_5.5.29-23.7.2-389.lucid_amd64.deb
wget http://www.percona.com/redir/downloads/XtraBackup/LATEST/deb/lucid/x86_64/percona-xtrabackup_2.0.5-499.lucid_amd64.deb
wget http://www.percona.com/redir/downloads/Percona-XtraDB-Cluster/LATEST/deb/lucid/x86_64/percona-xtradb-cluster-galera-2.x_143.lucid_amd64.deb
wget http://www.percona.com/redir/downloads/Percona-XtraDB-Cluster/LATEST/deb/lucid/x86_64/percona-xtradb-cluster-server-5.5_5.5.29-23.7.2-389.lucid_amd64.deb

Then, install these packages and stop the MySQL/Percona server since we need to configure the server as the first node of the cluster.

dpkg -i percona*.deb
service mysql stop

MySQL configuration

Next, edit the MySQL configuration at /etc/mysql/my.cnf and paste the content of this gist (gist no longer available unfortunately) which already includes the required configuration for the MySQL cluster nodes. An important note is in order here: the configuration in that gist is what I am currently using with a small MySQL cluster in which each node has 8 GB of ram, so you may want to tweak some settings depending on your case. I have included them as they have worked pretty well for me. You could just include the settings in the Galera synchronous replication section and you’d be fine as far as the configuration of the MySQL cluster is concerned. So it’s up to you if you want to try the other settings too.

Notes on some of the settings in the gist:

  • max-connections: this setting really depends on many things. I’ve set it to 500 but the correct value depends on how you will be using MySQL;
  • transaction-isolation: MySQL’s default setting is REPEATABLE-READ which isn’t optimal; I prefer READ-COMMITTED (which happens to be the default setting in PostgreSQL instead);
  • skip-name-resolve: prevents the server from performing a DNS lookup each time a client connects to it, speeding up connections a little bit;
  • innodb_support_xa: this is required by the Galera replication;
  • innodb_import_table_from_xtrabackup: it allows restores of single tables by replacing the tablespace even at runtime, which can be pretty handy when you don’t need to restore the whole database;
  • innodb_log_file_size: I can’t remember exactly how I determined the value of 50M but the important thing to keep in mind concerning this setting is that you won’t be able to use a datadir with InnoDB logs created with a different value (so you’d have to delete the existing logs and restart MySQL if you change the value);
  • innodb_file_per_table: this is a recommended setting for two reasons: it uses disk space better by storing the data in separate files for the various tables vs a single gigantic file that can become bloated overtime; it also allows for restores of single tables together with the previous setting;

As for the Galera synchronous replication section, you should basically use those settings as they are apart from:

  • server-id: this has to be a unique id for each node; you can choose any arbitrary value;
  • wsrep_cluster_name: of course this is the name you want to give to the MySQL cluster; it’s important that all nodes in the cluster have the same value;
  • wsrep_node_name: this as well should be different for each node; I usually use names such as db1,db2,…,dbN or node1,node2,…,nodeN;
  • wsrep_slave_threads: the recommended setting is 4 threads per CPU core;
  • wsrep_cluster_address: this very important setting determines the role of a node in the MySQL cluster; as we’ll see later, this should be set to gcomm:// on the first node when bootstrapping a new cluster. Once the cluster is ready and all the nodes have been configured, it is convenient to have the setting with value gcomm://db1,db2,…,dbN on each node instead; this makes it so a node, when restarted or rebuilt, will automatically try one node per time in the list to find a node that is available and ‘synced’, so that node can become its ‘donor’ when the first node joins or rejoins the cluster;
  • wsrep_sst_method: this determines the synchronisation strategy to use when a node joins or rejoins the MySQL cluster after being offline for maintenance or else; at the moment I tend to use the rsync strategy as it seems to be somewhat more stable, but another good option is Percona’s own xtrabackup; the main difference is that with the rsync strategy both joiner and donor are seen as unavailable during the transfer of data, while with xtrabackup the donor is supposed to be available. I haven’t yet tried this though.

So go ahead with bootstrapping the MySQL cluster with the first node you’ve just set up, by setting wsrep_cluster_address to gcomm://. Then restart MySQL, which should now apply all the settings in the /etc/mysql/my.cnf configuration file. Before restarting though, if you have set innodb_log_file_size to some custom value, you’ll need to delete the existing InnoDB log files otherwise MySQL won’t start. MySQL’s datadir is by default /var/lib/mysql so to delete the old log files you can run:

rm /var/lib/mysql/ib_logfile*
service mysql restart

If MySQL fails to restart, try starting it “manually” with

mysqld -u mysql

which will show information that may useful to debug the problem. Otherwise, the first node is now ready and you can go ahead with adding a node per time – at least two more nodes for an optimal configuration.

Adding nodes to the MySQL cluster

Adding more nodes to the cluster is almost an identical process to that required to set up the very first node, apart from a few key differences. In the MySQL’s configuration for each new node, make sure that

  • server-id and wsrep_node_name have different and unique values, i.e. not in use by any other nodes in the MySQL cluster;
  • wsrep_cluster_address: it should be set to the address of the first node or anyway one node already synced and available to be used as donor, so the joiner can receive data from it.

Having updated the MySQL configuration, stop MySQL for now on the joiner node(s), and update the the firewall configuration on all nodes so that they can eventually exchange data with each other. I usually prefer using hostnames or aliases rather than IP addresses in iptables’ configuration, since it’s easier to see at a glance what each rule is for. So open /etc/hosts on each node and add entries for the IPs in use by the other nodes. For example, if I am on node1 in a 3-nodes MySQL cluster, I’d change the /etc/hosts file so it looks something like

127.0.0.1 localhost
127.0.1.1 node1.localdomain node1

# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

10.211.55.7 node2
10.211.55.8 node3

The two lines at the bottom are the important ones (of course make sure you specify the correct IP addresses). Next, we need to update the firewall rules. Open again /etc/iptables.rules and add the following rules before the -A input -j REJECT rule:

-A INPUT -i eth0 -p tcp -m tcp --source node2 --dport 4567 -j ACCEPT
-A INPUT -i eth0 -p tcp -m tcp --source node2 --dport 4568 -j ACCEPT
-A INPUT -i eth0 -p tcp -m tcp --source node2 --dport 4444 -j ACCEPT
-A INPUT -i eth0 -p tcp -m tcp --source node2 --dport 3306 -j ACCEPT

Explanation: 4567 is the port another node will knock to to check whether this node is synced and available to become its donor; 4568 is unsed when an incremental state transfer (IST) is possible as opposed to a snapshot state trasnfer (SST) which basically is the copy of all of the data – whether IST is possible or not depends on how much the data on the joiner differsfrom the data on the donor; 4444 is the port used with the rsync strategy, while 3306 is of course the default port at which MySQL listens for clients.

In the example above, I was on node1 so I added rules for node2. It is important to have those four rules replicated for each of the other nodes in the MySQL cluster, so to allow each node to exchange data with any of the other nodes. To apply the changes right away, run

iptables-restore < /etc/iptables.rules

Done this, you can start MySQL on one joiner per time and the it will start receiving data from the donor you have specified in my.cnf. Once all the nodes are up and running and synced, I recommend you set wsrep_cluster_address to gcomm://node1,node2,..nodeN. This way you don’t have to change that setting each time you take a node offline and then online again for maintenance or else, because the joiner will automatically find the first node in that list which is available to provide it with the data. If all went well, when you start a new node just configured you can see it becomes a joiner and receives data from the donor by watching the MySQL related processes (e.g. you could monitor this with watch “ps waux

root 5167 0.1 0.1 19396 1952 pts/2 S+ 12:04 0:00 /bin/bash /etc/init.d/mysql start
root 5195 0.1 0.0 4108 728 pts/2 S+ 12:04 0:00 /bin/sh /usr/bin/mysqld_safe
mysql 5837 0.5 3.3 245612 33980 pts/2 Sl+ 12:04 0:00 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/lib/mysql/node3.err --pid-file=/var/lib/mysql/node3
mysql 5884 0.0 0.0 4108 612 pts/2 S+ 12:04 0:00 sh -c wsrep_sst_rsync --role 'joiner' --address '10.211.55.8' --auth '' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '5837'
mysql 5886 0.2 0.1 19248 1764 pts/2 S+ 12:04 0:00 /bin/bash -ue /usr//bin/wsrep_sst_rsync --role joiner --address 10.211.55.8 --auth --datadir /var/lib/mysql/ --defaults-file /etc/mysql/my.cnf --parent 5837
mysql 5909 0.0 0.0 10984 676 ? Ss 12:04 0:00 rsync --daemon --port 4444 --config /var/lib/mysql//rsync_sst.conf

In the example above I was using the rsync strategy; the output would look slightly different if you used the xtrabackup strategy instead. This is instead you would see on the donor while SST is happening:

root 746 0joiner.0 0.0 4108 688 ? S 11:38 0:00 /bin/sh /usr/bin/mysqld_safe
mysql 1448 0.1 10.6 1118380 108624 ? Sl 11:38 0:03 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/lib/mysql/node2.err --pid-file=/var/lib/mysql/node2
mysql 6938 0.0 0.0 4108 616 ? S 12:22 0:00 sh -c wsrep_sst_rsync --role 'donor' --address '10.211.55.8:4444/rsync_sst' --auth '(null)' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-fil
mysql 6939 1.0 0.1 17732 1592 ? S 12:22 0:00 /bin/bash -ue /usr//bin/wsrep_sst_rsync --role donor --address 10.211.55.8:4444/rsync_sst --auth (null) --socket /var/run/mysqld/mysqld.sock --datadir /var/lib/mysql/ --defa
mysql 6949 33.0 joiner 0.1 21112 1636 ? R 12:22 0:00 rsync --archive --no-times --ignore-times --inplace --delete --quiet --whole-file -f + /ib_lru_dump -f + /ibdata* -f + /ib_logfile* -f + */ -f -! */* /var/lib/mysql/ rsync:/

Once you have configured all the nodes your shiny new MySQL cluster is ready to be used as it is -yay!- but in order to take full advantage of it you will need to split reads/and writes either in your application or with a load balancer such as haproxy, which I cover next.

Load balancing and failover

Once you have two or (better) more nodes in the MySQL cluster, you could already use it as is and split connections or reads/writes at application level; however it’s perhaps easiest to use something like haproxy that will handle this for you and will also ensure nodes that are not in sync are ignored. Setting this up is quite easy; first, unless you have haproxy already installed, you can install it with

apt-get install haproxy

Next, edit /etc/haproxy/haproxy.cfg and paste the following lines:

listen mysql-cluster 0.0.0.0:3306
mode tcp
balance leastconn
option tcpka
option httpchk

server db1 node1:3306 check port 9200 inter 5000 rise 3 fall 3 maxconn 400
server db2 node2:3306 check port 9200 inter 5000 rise 3 fall 3 maxconn 400
server db3 node3:3306 check port 9200 inter 5000 rise 3 fall 3 maxconn 400

Ensure you have listed all the nodes; do not restart haproxy yet. First, we need to configure a service on each node that haproxy will use to monitor the nodes and automatically ignore nodes that are offline or not in sync with the rest of the MySQL cluster. This is typically done with xinetd, although there are certainly other ways to achieve the same result. If you don’t have xinetd installed yet, run apt-get install xinetd, then create the new file /etc/xinetd.d/mysqlchk, if it doesn’t exist yet (it appears that this configuration is now done automatically in the newest version of Percona MySQL cluster) and paste the following:

# default: on
# description: mysqlchk
service mysqlchk
{
  # this is a config for xinetd, place it in /etc/xinetd.d/
  disable = no
  flags = REUSE
  socket_type = stream
  port = 9200
  wait = no
  user = nobody
  server = /usr/bin/clustercheck
  log_on_failure += USERID
  only_from = 0.0.0.0/0
  # recommended to put the IPs that need
  # to connect exclusively (security purposes)
  per_source = UNLIMITED
}

Next, edit /usr/bin/clustercheck which may exist or not depending on the version of the MySQL cluster you have set up. If the file exists, just ensure that the variables MYSQL_USERNAME and MYSQL_PASSWORD are set to the correct MySQL credentials. If the file doesn’t elready xist instead, create it and paste the following:

MYSQL_USERNAME="..."
MYSQL_PASSWORD="..."
ERR_FILE="/dev/null"
AVAILABLE_WHEN_DONOR=0

#
# Perform the query to check the wsrep_local_state
#
WSREP_STATUS=`mysql --user=${MYSQL_USERNAME} --password=${MYSQL_PASSWORD} -e "SHOW STATUS LIKE 'wsrep_local_state';" 2>${ERR_FILE} | awk '{if (NR!=1){print $2}}' 2>${ERR_FILE}`

if [[ "${WSREP_STATUS}" == "4" ]] || [[ "${WSREP_STATUS}" == "2" && ${AVAILABLE_WHEN_DONOR} == 1 ]]
then
# Percona XtraDB Cluster node local state is 'Synced' => return HTTP 200
/bin/echo -en "HTTP/1.1 200 OK\r\n"
/bin/echo -en "Content-Type: text/plain\r\n"
/bin/echo -en "\r\n"
/bin/echo -en "Percona XtraDB Cluster Node is synced.\r\n"
/bin/echo -en "\r\n"
exit 0
else
# Percona XtraDB Cluster node local state is not 'Synced' => return HTTP 503
/bin/echo -en "HTTP/1.1 503 Service Unavailable\r\n"
/bin/echo -en "Content-Type: text/plain\r\n"
/bin/echo -en "\r\n"
/bin/echo -en "Percona XtraDB Cluster Node is not synced.\r\n"
/bin/echo -en "\r\n"
exit 1
fi

If you run /usr/bin/clustercheck manually on an active, synced node, you’ll see the following output as expected:

HTTP/1.1 200 OK
Content-Type: text/plain

Percona XtraDB MySQL Cluster Node is synced.

Now restart xinetd with /etc/init.d/xinetd restart and then test that the script can also be run via the port specified in the xinetd configuration (9200):

root@node1:~# telnet localhost 9200
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
HTTP/1.1 200 OK
Content-Type: text/plain

Percona XtraDB MySQL Cluster Node is synced.

Connection closed by foreign host.

Now you can reload haproxy as well with

service haproxy reload

and ensure your applications connect to the load balancer instead of any nodes of the MySQL cluster directly. One last thing I’d like to suggest which I find very useful, is to use haproxy’s web interface to check the status of nodes, especially when you take one node offline for maintenance and want to check that it rejoins the cluster correctly when that’s done. Edit /etc/haproxy/haproxy.cfg again and add the following lines (ensure you use a good combination of username and password and optionally use a custom port):

listen stats 0.0.0.0:8282
mode http
option httpchk
balance roundrobin
stats uri /
stats refresh 10s
stats realm Haproxy\ Statistics
stats auth username:password

Once you reload haproxy again, you will be able to see the status of the MySQL cluster’s nodes from the UI at the port specified (8282 or whichever you have chosen):

Testing

Testing the MySQL cluster is quite easy: just take a node offline or kill -9 MySQL and delete all thata on a node, and see what happens when you restart MySQL :p

Conclusion

I think that despite both Galera and Percona XtraDB Cluster are relatively new, this combination is definitely the best setup I have worked with so far for MySQL databases; it’s nice to have the peace of mind that nodes can be taken offline at any time for maintenance and have them resynced automatically without downtime, and at the same time scale reads and to some extent writes too. I will certainly play again with alternatives such as MongoDB and similar, but I have been using a MySQL cluster with Percona/Galera in production for a while now and it’s been pretty stable requiring very little maintenance, so that’s the reason why for the time being I will stick to MySQL rather than rethinking the applications I am working on so to adapt them to other solutions. I will, however, very likely look into similar clustering solutions for PostgreSQL since I am getting more and more interested in this database these days.

By the way I also published another post on how to perform MySQL hot backups with Percona’s Xtrabackup, you may want to take a look.

I would be interested to hear others’ experiences with a MySQL cluster with Percona/Galera or any alternatives that have worked well for them.

© Vito Botta