Getting started with Apache Cassandra

Introduction

Scaling a database, regardless of the technology behind it, is always a challenge. This is particularly true with a traditional RDBMS such as MySQL and PostgreSQL which power most applications out there. With “standard” replication it is possible to scale reads but not writes as in most configurations there’s just a single master and many slaves. Then there is the issue of high availability: a single master means a single point of failure. Replication lag can also be an issue in many cases.

For MySQL there are solutions based for example on the Galera replication (in the past I’ve often used Percona’s XtraDB Cluster with success) which removes the replication lag issue and make it possible to scale reads and to some extent also writes while also having a highly available system at the same time (I’m sure there are similar solutions for other RDBMS). Unfortunately, a system like this requires that each node store a copy of all of the data, making it impossible to manage huge volumes of data however beefy the servers are. Techniques such as sharding can help with this but can be a nightmare depending on the upfront design.

Over the past few years, new technologies commonly referred to as “NoSQL” (which stands for “Not only SQL” or “Non relational” depending on who you ask) have become very popular as they address some or all the aforementioned issues with relational databases and are a better fit when scalability and high availability of huge volumes of data are paramount.

One of the best NoSQL database systems currently available is Apache Cassandra, which was open sourced by Facebook in 2008 and became an Apache project in 2010. At the moment of this writing the latest version is 3.10.

Based on Amazon Dynamo for the distributed architecture and Google’s BigTable for the data model, Cassandra is an open source distributed, masterless, highly available (no SPOF), fault tolerant and very fast NoSQL database system written in Java that runs on commodity servers and offers near linear scalability; it can handle very large volumes of data across lots of nodes even in different data centers.

Sounds great, right? It does have some limitations though such as the lack of joins and limited support for aggregations in CQL (Cassandra Query Language, a SQL-like language which we’ll look into later); however these limitations are by design to force you to denormalize data so that most queries can be efficiently executed on a single node rather than on multiple nodes or the entire cluster, which wouldn’t be very efficient. Other limitations include per-partition ordering defined when creating a table, and that data for a partition must fit on a single node (it’s important to design tables so that partitions don’t grow indefinitely). The key here is not to try to just use Cassandra as a replacement for a RDBMS when designing applications.

Besides Facebook, many companies such as Netflix and Apple use Cassandra to manage massive amounts of data distributed across thousands of nodes in multiple data centers. Cassandra is available in various flavours, including a “standard” open source version and those offered by DataStax with additional features. In this post, we’ll see how to set up and use a Cassandra cluster using the standard open source version. We’ll use Ubuntu 16.04 LTS as the Linux distro for the nodes of the cluster, so many steps will be different if you use another distro.

Setting up the nodes

To play with Cassandra, I suggest you either use a virtualisation software on your computer (hereinafter the “host”) such as the free Virtualbox, or use some cloud VPS provider such as Amazon AWS, Digital Ocean, or others – which will cost money though. If you go the first route – which I recommend for simplicity – make sure you configure your VMs with “bridged” virtual network adapters otherwise you may have some problems with networking between the nodes.

I will assume here that you use a virtualisation software. To get started, run ifconfig on the host and take note of your IP address; we’ll configure the Cassandra nodes to use static IPs in the same network so that the host can easily SSH into the each node. In my case, for example, the IP of the host is 192.168.10.56.

To setup the first node, create a VM configured with as many cores as the number of cores on your computer and 1GB or more of RAM depending on how much RAM is installed on the host.

Start this first VM, the login and as root edit with VIM or another editor the file /etc/network/interfaces to change the line

iface enp0s3 inet dhcp

with these lines (taken from a node of my test cluster as example):

iface enp0s3 inet static
  address 192.168.10.101
  gateway 192.168.10.1
  netmask 255.255.255.0
  dns-nameservers 8.8.8.8

Make sure you specify the correct name for the network interface in your case, and the correct IP and gateway. In my case, the gateway is 192.168.10.1 and I’m gonna use IPs from 192.168.10.101 to 192.168.10.104 for the 4 nodes in the test cluster (we’ll set up 4 nodes to observe how Cassandra behaves depending on the replication factor).

Next, edit /etc/hostname and change the hostname to something like node1. I’m gonna name my nodes as node1 to node4. Done that, edit /etc/hosts and add the following, so that the node will already have some configuration to reach the other nodes later:

192.168.10.101    node1
192.168.10.102    node2
192.168.10.103    node3
192.168.10.104    node4

Reboot with sudo reboot for the changes to take effect.

To more easily work on the VM from a terminal on the host, edit ~/.ssh/config on the host and add

Host node1
  Hostname node1
  User user

Do the same for node2, node3 and node4.

Also, still on the host, edit /etc/hosts and add:

192.168.10.101 node1
192.168.10.102 node2
192.168.10.103 node3
192.168.10.104 node4

Then run ssh-copy-id node1 to copy your SSH key (assuming you have one) into the VM and then run ssh node1, and login.

Now that you are in the VM from a terminal on the host, you need to install Java as it’s Cassandra’s main dependency.
Run the following commands:

sudo add-apt-repository ppa:webupd8team/java
sudo apt update
sudo apt install oracle-java8-installer

To check if Java is correctly installed run

java -version

Next, run the following to install Cassandra:

echo "deb http://www.apache.org/dist/cassandra/debian 36x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
curl https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -
sudo apt update
sudo apt install cassandra

Cassandra should now be up and running as a single node – you can check with

ps waux | grep cassandra

Next, stop Cassandra with

sudo service cassandra stop

Then edit /etc/cassandra/cassandra.yaml and change the following:
– set cluster_name to whatever, or leave it as the default “Test Cluster” (make sure though you use the same cluster name on all of the nodes);
– search for seed_provider and under seeds set the IP of this first node to 192.168.10.101; this means that the first node will kinda use itself to seed its data, another way of saying that it will start as a single node cluster;
– change listen_address and rpc_address to 192.168.10.101; the former is the address the other nodes will connect to in order to exchange information with this node through “gossip”. The latter is the address clients will connect to in order to execute queries.

Now restart Cassandra with:

sudo service cassandra start

You can run nodetool status to verify that the single node cluster is up and running. You’ll see something like

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.10.101 2.78 MiB 256 100% 2611d399-71f6-4696-90f1-2cc390f1079e rack1

Where “UN” stands for “UP and Normal” (Normal meaning the node is fully operational).

To set up the other nodes, repeat all the steps so far but make sure you set, on each node:
cluster_name to the same name you specified on the first node
listen_address and rpc_address to the IP address of the new node
seeds to the IPs of the other nodes

If you run nodetool status again from any node, you’ll see something like the following

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.10.104 2.72 MiB 256 46.9% 6bace8f9-469b-439b-9f70-4c9ceb1848ed rack1
UN 192.168.10.101 209.79 KiB 256 48.8% 2611d399-71f6-4696-90f1-2cc390f1079e rack1
UN 192.168.10.102 2.22 MiB 256 51.5% cf555392-5d96-4ad4-81bb-60c70c397fdf rack1
UN 192.168.10.103 225.34 KiB 256 52.8% 6627812b-0400-4a32-aaee-c1bdd4daff1d rack1

This shows that all the 4 nodes are up and running in the same cluster.

Connecting to the cluster with a client

CQL, or Cassandra Query Language, is a SQL-like language that can be used to communicate with Cassandra from clients and execute queries. You can experiment with CQL using the cqlsh shell provided with Cassandra upon installation. In order for it to work, you must install python and the driver for Cassandra:

sudo apt install python python-pip
sudo pip install cassandra-driver

The last step might take some time. Now you should be able to run cqlsh from any node with:

CQLSH_NO_BUNDLED=true CQLSH_HOST=192.168.10.102 cqlsh

You can specify the IP address of any node, it doesn’t matter really. You’ll see a prompt similar to this:

Connected to Test Cluster at 192.168.10.101:9042.
[cqlsh 5.0.1 | Cassandra 3.6 | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh>

For a full reference on CQL see https://cassandra.apache.org/doc/latest/cql/index.html. Here I’ll show a few basic commands to get started.

From the CQL prompt, run:

create keyspace testdb WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

Here we are creating a keyspace, that is a database in Cassandra parlance, named testdb. We are specifying SimpleStrategy, which means a simple replication factor for the cluster; in a production environment you may want to use NetworkTopologyStrategy instead since this allows to specify a different replication factor for each data center. The replication factor is simply the number of nodes which will hold replicas of the data. It is also possible to change the replication factor after creating the table, as we’ll see later.

Now, let’s create a sample table which will contain names, after switching to the newly created keyspace:

use testdb;
create table names (first_name text, last_name text, PRIMARY KEY (last_name, first_name));

We can then insert data as we would do in a typical SQL database:

insert into names (first, last) values ('Vito', 'Botta');

Note: when creating the table we specified a composite primary key on the last_name and first_name columns; the first column specified is also the partition key, meaning that the rows having the same values for that column will be stored together in the same partition. As mentioned earlier, it is important that tables are designed in such a way that partitions don’t grow indefinitely because each partition must fit entirely on each node that contains a replica of it; for example you may partition by date. The second column, first_name, will determine the ordering of the data. So we’ll basically have names with the same last name stored together and sorted by first_name.

To test this let’s load some sample data. I should mention that you can generate some test data can be done with the cassandra-stress tool but here we’ll use a simple Ruby script to generate some random names to play with. Assuming Ruby is already installed, run

gem install cassandra-driver
gem install faker

to install the gems required for this example. cassandra-driver is the equivalent of an ORM for Cassandra, while faker is a library that generates real-looking sample data, in this case we’ll use it to generate first names and last names. Open an editor and create a file named e.g. cassandra-test.rb with the following content:

require "cassandra"
require "faker"

cluster = Cassandra.cluster(hosts: ["192.168.10.101", "192.168.10.102", "192.168.10.103", "192.168.10.104"])
keyspace = 'testdb'
session = cluster.connect(keyspace)
statement = session.prepare('INSERT INTO names (first_name, last_name) VALUES (?, ?)')

1_000.times do
  batch = session.batch do |batch|
  1_000.times do
  batch.add(statement, arguments: [Faker::Name.first_name, Faker::Name.last_name])
  end
  end
  session.execute(batch)
end

In this sample script, we are connecting to the cluster specifying the IPs of its nodes (you don’t have to specify all of them, as it will automatically figure out the missing nodes and use this information for automatic load balancing), switching to the testdb keyspace, creating a “prepared statement”, then generating and executing 1000 batches of 1000 inserts each with first names and last names generated by Faker. This way we create approximately 1M rows (I say “approximately” because Faker may generate duplicate combinations of first name and last name so the total rows created will likely be close to 1M but not exactly 1M, because new rows with an existing combination of first name and last name will be ignored when inserting). The reason why I am running 1000 times a batch of 1000 inserts is that there is a limit to the size of the batch that can be executed at once.

Run the script with:

ruby cassandra-test.rb

Then, to check that the data has been generated, go back to cqlsh on any node and run

select * from names limit 10;

You’ll see something like the following:

 last | first
--------+-------------
 Jacobi | Abelardo
 Jacobi | Adolph
 Jacobi | Agustina
 Jacobi | Aileen
 Jacobi | Alayna
 Jacobi | Alberto
 Jacobi | Alexandrea
 Jacobi | Alexandrine
 Jacobi | Alycia
 Jacobi | Amalia

Now run nodetool status on any node and see how the data is distributed across the nodes (“Owns”) – remember that we created the keyspace specifying a replication factor of 1. You’ll see something like this:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.10.104 2.11 MiB 256 24.8% 6bace8f9-469b-439b-9f70-4c9ceb1848ed rack1
UN 192.168.10.101 203.69 KiB 256 24.6% 2611d399-71f6-4696-90f1-2cc390f1079e rack1
UN 192.168.10.102 294.43 KiB 256 26.1% cf555392-5d96-4ad4-81bb-60c70c397fdf rack1
UN 192.168.10.103 232.2 KiB 256 24.5% 6627812b-0400-4a32-aaee-c1bdd4daff1d rack1

Note that Owns is ~25% on each node. With a replication factor of 1 the data is distributed but not replicated, so if we lose one node, for example, the cluster is still up and running but we lose 25% of the data.

Let’s change the replication factor to 3 with

alter keyspace testdb with replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

Run again nodetool status:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.10.104 2.11 MiB 256 74.2% 6bace8f9-469b-439b-9f70-4c9ceb1848ed rack1
UN 192.168.10.101 208.7 KiB 256 74.9% 2611d399-71f6-4696-90f1-2cc390f1079e rack1
UN 192.168.10.102 294.43 KiB 256 73.0% cf555392-5d96-4ad4-81bb-60c70c397fdf rack1
UN 192.168.10.103 232.2 KiB 256 78.0% 6627812b-0400-4a32-aaee-c1bdd4daff1d rack1

As you can see now each node stores 75% of the data because we have 4 nodes and a replication factor of 3.

If we set the replication factor to 4:

alter keyspace testdb with replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

We’ll see that each node contains all the data (Owns = 100%):

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.10.104 2.12 MiB 256 100.0% 6bace8f9-469b-439b-9f70-4c9ceb1848ed rack1
UN 192.168.10.101 198.69 KiB 256 100.0% 2611d399-71f6-4696-90f1-2cc390f1079e rack1
UN 192.168.10.102 284.43 KiB 256 100.0% cf555392-5d96-4ad4-81bb-60c70c397fdf rack1
UN 192.168.10.103 237.21 KiB 256 100.0% 6627812b-0400-4a32-aaee-c1bdd4daff1d rack1

Cool, uh?

Monitoring the cluster

We’ve already seen how the nodetool status command can list all the nodes in the cluster with some basic information about their status. Among other useful commands you can use with nodetool is nodetool info, which shows information about a particular node. You can see all the possible commands by just running nodetool without any parameters.

There are other tools which can be used for monitoring and that like nodetool communicate with Cassadra through JMX (Java Management Extensions). For example, jconsole – available with the JDK – allows to “look” inside a Java process and, in our case, to see lots of useful metrics for the Cassandra cluster. It requires a GUI though so it won’t work out of the box on our server version of Ubuntu. However you can run it from another machine (such as the host computer in our case) by connecting to a node on the port 7199. To make it work with our test cluster, you’ll need to either enable password authentication or edit /etc/cassandra/cassandra-env.sh on a node and change the line

JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.authenticate=true"

to

JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.authenticate=false"

And change

LOCAL_JMX=yes

to

LOCAL_JMX=no

Then restart cassandra with sudo service cassandra restart. You should now be able to run jconsole on the host and connect to the remote process on the node at :7199.

Another option is DataStax OpsCenter which is a web app that allows to both monitor and manage a Cassandra cluster. It’s a great tool, but unfortunately the latest version (6.x) no longer supports open source/community editions of Cassandra. So you’d have to use the DataStax Enterprise (DSE) distribution of Cassandra instead.

Repairing a node

When a node goes down and then comes back later, it may have out of date data. Assuming a replication factor greater than 1, when you bring the node back in the cluster you should “repair” it so that the node can be updated using another replica as seed. This can be done with the nodetool repair command:

nodetool -h **ip of the node to repair** repair

Optionally, it is possible to specify a keyspace to repair

nodetool -h **ip of the node to repair** repair **name of the keyspace**

Removing a node / Adding a node back to the cluster

When a node requires maintenance or you want to reduce capacity, you can remove it from the cluster. In the case the removal is planned, you can use the

nodetool -h **ip of the node** decommission

command to decommission the node. If you run nodetool status from any node while the node is being decommissioned, you will see that the status for that node is UL, meaning that the node is still up but is leaving the cluster. Eventually, Cassandra will automatically redistribute the data stored on the node being decommissioned to the other nodes and the decommissioned node will disappear altogether from the nodetool status list. Note that decommissioning a node does not remove data from that data. Also note that if you are running this command from a node to decommission another node, you may need to configure authentication or disable it on the node being decommissioned (as explained earlier), so that the two nodes can communicate via JMX.

Because the data is not removed from a decommissioned node, before adding it back to the cluster (for example once maintenance is complete) you may want to “clear” that data if the node has been down for a while – this also makes adding the node back to the cluster quicker. To delete the data from the decommissioned node, run (on that node):

sudo service cassandra stop
cd /var/lib/cassandra
sudo rm -rf commitlog data saved_caches
sudo service cassandra start

The node will reappear in the nodetool status list as UJ (J stands for “joining”); Cassandra will redistribute the data once again including this node and eventually the status of this node will be UN.

When adding back a node to the cluster, you may want to “compact” the data stored on the other nodes, since the data that was previously copied from the decommissioned node to those nodes, remains for a while on those nodes. You can do this with the following command:

nodetool -h **ip of node** cleanup

From nodetool status you’ll see that the amount of data stored on the node (“Load“) is reduced.

We’ve seen how to decommission a node for maintenance or something like that; if a node unexpectedly dies for example due to hardware failure, you’d want to use the removenode command instead of decommission to remove it. First, take note of the ID of the node to remove from the nodetool status list, then run

nodetool removenode **ID of the node**

Adding the node back to cluster once the problem has been fixed works the same as for a decommissioned node.


This was just an introduction to Cassandra; it’s a big topic and it’s impossible to cover everything in a single post also because I am still learning myself, so I may write more about it later as I learn.

MySQL: Generate column names dynamically from row values

Let’s say you have a table or the results of a query with some values by date in different rows. You want to generate column names dynamically from these dates, and show the relevant results as values for these new columns instead of values in different rows. So you basically want to transpose rows into columns and have dynamically generated column names at the same time. Unfortunately MySQL doesn’t allow the use of functions to generate column names, and as far as I know it doesn’t have a means out of the box to generate column names dynamically in general (please let me know if I am mistaken; I’m keen to learn something new), but it is definitely possible at least with a trick using prepared statements. Let me show you with an example.

I have a table, named hcu_collection, which I use to collect some data for each of a number of software licenses. The relevant columns in this example are the license_id, the collect_date and an integer column named total_hv_cores (it’s just an example from a real app so ignore the meaning of this column as it’s not important); for example I want to know the MAX(total_hv_cores) by date for each license over the past 3 days. So I can use a simple query like the following:

SELECT license_id, collect_date, MAX(total_hv_cores) cores
FROM hcu_collection
WHERE collect_date >= SUBDATE(CURDATE(), 2)
GROUP BY license_id, collect_date
ORDER BY collect_date ASC, collect_hour ASC;

which produces these results:

+------------+--------------+-------+
| license_id | collect_date | cores |
+------------+--------------+-------+
| 18 | 2015-12-04 | 1108 |
| 67 | 2015-12-04 | 436 |
| 102 | 2015-12-04 | 140 |
...
...
| 12174 | 2015-12-10 | 78 |
| 12380 | 2015-12-10 | 624 |
...
...

What I want instead is a table that looks like the following, for example for the past 3 days:

+------------+-----------+-----------+-----------+
| license_id | Tue 08/12 | Wed 09/12 | Thu 10/12 |
+------------+-----------+-----------+-----------+
| 2 | 238 | 238 | 246 |
| 3 | 60 | 68 | 68 |
| 4 | 12 | 16 | 12 |
| 7 | 212 | 212 | 220 |
...
...

As said I am not aware if MySQL already has some means to achieve this, so the way I have done it is by generating a query dynamically which, when executed, will then generate the column names from the dates as I want.

The first step is to create a temporary table with the results from the original query, for convenience, since we are going to need these results more than once in the query that will be generated dynamically.

DROP TABLE IF EXISTS tmp_results;

CREATE TEMPORARY TABLE tmp_results AS
SELECT license_id, collect_date, MAX(total_hv_cores) cores
FROM hcu_collection
WHERE collect_date >= SUBDATE(CURDATE(), 2)
GROUP BY license_id, collect_date
ORDER BY collect_date ASC, collect_hour ASC;

Next, we need to generate a new query dynamically. Here’s an example:

SELECT CONCAT('
SELECT license_id, ',cores_by_dates,'
FROM tmp_results
GROUP BY license_id
ORDER BY license_id'
)
INTO @query
FROM
(
SELECT GROUP_CONCAT(CONCAT('IFNULL(MAX(CASE WHEN collect_date=''',actual_date,''' THEN cores END), ''-'') AS "',col_name,'"')) cores_by_dates
FROM (
SELECT actual_date, DATE_FORMAT(actual_date,'%a %d/%m') AS col_name
FROM (SELECT DISTINCT collect_date AS actual_date FROM tmp_results) AS dates
) dates_with_col_names
) result;

The important bit is

SELECT GROUP_CONCAT(CONCAT('IFNULL(MAX(CASE WHEN collect_date=''',actual_date,''' THEN cores END), ''-'') AS "',col_name,'"')) cores_by_dates
FROM (
SELECT actual_date, DATE_FORMAT(actual_date,'%a %d/%m') AS col_name
FROM (SELECT DISTINCT collect_date AS actual_date FROM tmp_results) AS dates
) dates_with_col_names

which generates something like:

IFNULL(MAX(CASE WHEN collect_date='2015-12-08' THEN cores END), '-') AS "Tue 08/12",IFNULL(MAX(CASE WHEN collect_date='2015-12-09' THEN cores END), '-') AS "Wed 09/12",IFNULL(MAX(CASE WHEN collect_date='2015-12-10' THEN cores END), '-') AS "Thu 10/12"

We save this new query in @query so that we can use it to prepare a statement:

PREPARE statement FROM @query;

Last, we just need to execute it:

EXECUTE statement;

This shows the results I want, with the dates as column names. Don’t forget to deallocate the prepared statement after fetching the results:

DEALLOCATE PREPARE statement;

Note: depending on how many dates you use to generate the columns, you may exceed the limit allowed for GROUP_CONCAT‘s length (default is 1024 bytes). So you may need to add something like

SET SESSION group_concat_max_len = 1000000;

before the dynamic generation of the query.

Hope it can be useful to someone.

Easier backups with duplicity and xtrabackup

A little while ago I wrote a couple of scripts to take backups with duplicity and xtrabackup more easily; I am a little allergic to all the options and arguments you can use with both duplicity and xtrabackup, so these scripts use simple configuration files instead.

You can find these scripts on Github at https://github.com/vitobotta/admin-scripts.

xtrabackup

Xtrabackup is a great tool for taking backups (both full and incremental) of your MySQL databases without bringing them offline. When you first launch the script – admin-scripts/backup/xtrabackup.sh – without arguments, it will generate the simple configuration file as ~/.xtrabackup.config, containing the following configuration settings – you only need to set the MySQL credentials, customise the paths of source and destination, and choose how many backup chains to keep:

MYSQL_USER="..."
MYSQL_PASS="..."
MYSQL_DATA_DIR=/var/lib/mysql
BACKUPS_DIRECTORY=/backup/mysql/
MAX_BACKUP_CHAINS=4

A backup chain is as usual made of one full backup and subsequent incrementals. The script – admin-scripts/backup/xtrabackup.sh accepts a single argument when you are taking backups, either full or incr. As these may suggest, in the first case a full backup will be taken, while the second case it will be an incremental. Backups are stored in the destination directory with the structure below:

/backup/mysql
├── full
│ ├── 2014-03-04_20-39-39
│ ├── 2014-03-09_02-00-04
│ ├── 2014-03-16_02-00-01
│ └── 2014-03-23_02-00-02
└── incr
├── 2014-03-04_20-39-53
├── 2014-03-04_20-41-21
├── 2014-03-05_02-00-02
├── 2014-03-05_13-00-02
├── 2014-03-06_02-00-07

I choose to store the incrementals separately from the full backups so to always have full backups ready for a simple copy if needed, but restoring from incrementals will work just fine. In order to restore, you can choose from any of the backups available – either full or incremental. To see the list of all the backups available you can use the list argument, which shows something like this:

> admin-scripts/backup/xtrabackup.sh list
Loading configuration from /root/.xtrabackup.config.
Available backup chains (from oldest to latest):

Backup chain 1:
...

Backup chain 2:
...

Backup chain 3:
Full: 2014-03-16_02-00-01
Incremental: 2014-03-16_13-00-01
Incremental: 2014-03-17_02-00-02
...
Incremental: 2014-03-21_13-00-01
Incremental: 2014-03-22_02-00-01
Incremental: 2014-03-22_13-00-02
Backup chain 4:
Full: 2014-03-23_02-00-02
Incremental: 2014-03-23_13-00-01
Incremental: 2014-03-24_02-00-03
Incremental: 2014-03-24_13-00-01
Incremental: 2014-03-25_02-00-01
Incremental: 2014-03-25_13-00-02

Latest backup available:
Incremental: 2014-03-25_13-00-02

Then, to restore any of the backups available you can run the script with the restore argument, e.g.

admin-scripts/backup/xtrabackup.sh restore 2014-03-25_02-00-01 <destination directory>

Once the restore is complete, the final result will be a destination directory ready for use with MySQL, so all you need to do at this stage (as the script will suggest) is:

  • stop MySQL
  • replace the content of MySQL’s datadir with the contents of the destination directory you’ve used for the restore
  • ensure the MySQL datadir is owned by the mysql user
  • start MySQL again

MySQL should happily work again with the restored data.

duplicity

The other script is a useful wrapper which makes it a bit easier to take backups of data with duplicity; like the other script, this script also uses a configuration file instead of lots of options and arguments, and this configuration file is generated as ~/.duplicity.config when you first run the script with no arguments. The content of this configuration file is as follows:

INCLUDE=(/backup /etc /home /root /usr/local/configuration /var/log /var/lib/mysql /var/www)

BACKUPS_REPOSITORY="rsync://user@host//backup_destination_directory/"

MAX_FULL_BACKUPS_TO_RETAIN=4
MAX_AGE_INCREMENTALS_TO_RETAIN=1W
MAX_AGE_CHAINS_TO_RETAIN=2M
MAX_VOLUME_SIZE=250

ENCRYPTION=1
PASSPHRASE=...

# Set ENCRYPT_KEY if you want to use GPG pub key encryption. Otherwise duplicity will just use symmetric encryption.
# ENCRYPT_KEY=

# Optionally use a different key for signing
# SIGN_KEY=
# SIGN_KEY_PASSPHRASE=

COMPRESSION_LEVEL=6 # 1-9; 0 disables compression; it currently works only if encryption is enabled

VERBOSITY=4 # 0 Error, 2 Warning, 4 Notice (default), 8 Info, 9 Debug (noisiest)

# Comment out the following if you want to run one or more scripts before duplicity backup.
RUN_BEFORE=(/root/admin-scripts/backup/xtrabackup.sh)

# Comment out the following if you want to run one or more scripts after duplicity backup.
#RUN_AFTER=()

Most of these settings should be self-explanatory. backups_repository uses by default duplicity’s rsync backend, so of course you need to have SSH access to the destination server. max_volume_size: duplicity automatically splits the backup into volumes and the script will use settings that have duplicity generate one volume while the previous one is being asynchronously transferred to the destination. This should make backups faster. The ideal value for max_vol_size is really difficult to determine as it depends on many things, but in my case I have found that a value of 250 with the other settings I use for compression and encryption, makes backups fairly fast. encryption of course enables/disables the encryption of the backup; if you are doing on site backup to servers you own and that noone else controls, then I’d disable this option so to make backups quicker. Otherwise I recommend to enable it if others have access to the backup files. Encryption can be done both with (GPG) keys, or without keys, using symmetric encryption with a passphrase. Then, you can set the compression level; I’d recommend the value 6 as from my tests higher compression slows down backups for little gain. As the comment in the configuration file suggests, compression is currently available only when encryption is also enabled.

Lastly, as you can see you can choose to run other scripts before and/or after the backup with duplicity is performed. In the configuration above you can also see that I normally run the backup with the xtrabackup script first, so that the backup taken with duplicity also includes the latest MySQL backup. I find this pretty useful. Like for the other script, you need to specify the full or incr argument when taking backups; this argument will automatically be passed to the scripts specified in run_before and run_after so, for example, when taking an incremental backup with duplicity, an incremental backup with xtrabackup is taken first.

Restoring latest backup available

Example:

duplicity -v debug rsync://user@host//backup_directory <destination>

Note: Duplicity will not overwrite an existing file.

duplicity – other useful commands

Restoring from backups with duplicity is a little more straightforward than backing up, so I haven’t added any commands for this in the script really. However I’ll add here, for reference, some useful commands you may likely need when restoring or else directly with duplicity. These are examples assuming you use duplicity with symmetric encryption, in which case you need to have the PASSPHRASE environment variable set and available:

export PASSPHRASE=... # the passphrase you've used in the configuration file; you'll need this will all

If you add these commands in some other scripts, remember to unset this variable with

unset PASSPHRASE
Listing available backups
duplicity -v debug collection-status rsync://user@host//backup_directory
Listing all files in current backup
duplicity -v debug list-current-files rsync://user@host//backup_directory
Restoring by date / specific files (e.g. 3 days ago)
duplicity -v debug -t 3D --file-to-restore FILENAME rsync://user@host//backup_directory <destination>

Also:

duplicity -v debug --restore-time 1308655646 rsync://user@host//backup_directory <destination> (unix time)
duplicity -v debug --restore-time 2011-06-21T11:27:26+02:00 rsync://user@host//backup_directory <destination>

Note: timestamps shown when listing available backups are in already in timezone, while the time on the server is in UTC. So a backup made e.g. on 24/2/2014 at 02:00 on the server means it will be listed as Mon Feb 24 04:00:35 2014. Restoring this backup means using the timestamp xxxx-xx-xxT02:00:00+02:00

If you are looking to use free tools, these scripts and commands should have your backup needs on servers covered in most cases.

MySQL Cluster with Percona/Galera

Why a MySQL cluster

I have been using MySQL for many years as the RDBMS of choice for most applications; it does have its quirks, and it may lack some features I wish it had (and that other relational databases such as PostgreSQL have), but generally speaking it works fairly well and has good performance and reliability; also, I am anyway more familiar with it than with other databases. While these days the buzz is mainly for the so called NoSQL/schemaless/document-store/key-value-store/you-name-it alternatives, in my opinion relational databases are still a good option in many cases and are often also easier to use.

For a while the typical solution I used to rely on to scale MySQL databases was based on asynchronous replication, partitioning, and sharding, depending on the case. However I got tired of slaves going out of sync, and sharding can be a very good or a very bad idea depending on how it is done and or how well you can guess and plan in advance how the data will be used. In the past I’ve also tried the ‘standard’ MySQL Cluster, multi master replication setups, and various proxying configurations; however none of these were 100% reliable or easy enough to setup and maintain. About a year ago I started playing with a different type of MySQL cluster based on the synchronous replication provided by the Galera plugin (byCodership – who are also based here in Finland); Galera enables virtually synchronous replication to allow for reading from/writing to any node; furthermore, it automatically handles node provisioning. Better performance than the ‘standard’ MySQL cluster, no more slaves out of sync, true multi master replication and scalability out of the box with very little maintenance. Yay!

Rather than using the plugin directly with the standard ‘distribution’ of MySQL, I prefer using Percona‘s own distribution which includes many optimisations and also the XtraDB storage engine, a drop in replacement for InnoDB that performs a lot better in many scenarios; in addition, Percona XtraDB Cluster also includes the Galera plugin, so you are ready to configure a new MySQL cluster in a very short time. You can find instructions on how to setup a MySQL cluster on Percona’s website as well, but here I’d like to add a few slightly different instructions on how to use packages you can download rather than using the repositories for your Linux distribution provided by Percona. The reason I prefer to use these packages is that in a couple cases I have noticed that the packages available for download are newer that those you’d install from the repositories. I will also be covering some firewalling and proxying configuration so to have a secure and highly available MySQL cluster.

I will assume here you want to set up a complete MySQL cluster from scratch; you can skip some steps as you wish if that’s not your case. I will also assume here you already have linux boxes with at least the basic OS up to date; the following instructions will work as they are with Debian based distros (I normally use Ubuntu).

SSH setup

First things first, let’s lock down each node by configuring SSH authentication and the firewall. We need to configure public key authentication and disable the weaker password based authentication; still from your client computer, copy your public key to your new server; there are various ways to do this but perhaps the easiest is with the utility ssh-copy-id already available with most distros (if you are on OSX and use Homebrew, you can install it with brew install ssh-copy-id). Assuming your first node is called something like node1:

ssh-copy-id -i ~/.ssh/id_dsa.pub node1

Test now the pub key authentication by SSH’ing into the box; you shouldn’t be required to enter your password this time. Next, if all looks good, edit /etc/ssh/sshd_config and change the port number defined at the top of the file with the port you want to use; then uncomment the line that has the setting PasswordAuthentication yes and change that setting to no so to force authentication with public key, which is more secure. Now restart SSH with

service ssh restart

making sure you don’t close your current terminal session until you have successfully tested the new configuration.
Next, from your client computer, edit ~/.ssh/config and paste the following:

Host node1 # or whatever
HostName … # ip or hostname of the box
User … # user account you'll be using on the box
Port … # custom port

Replace the placeholder text with the actual IP of the server, the username you’ll be using on the box and the SSH port you’ve chosen earlier; I recommend using a different port rather than the default one (22). Try now again to SSH into the box with just

ssh node1

You should be logged in if all went OK.

Firewall with iptables

For now, we’ll lock down the box with a restrictive iptables configuration; later we’ll open some port required for the MySQL cluster to function. Edit /etc/iptables.rules and paste the following:

# Generated by iptables-save v1.4.4 on Tue Feb 19 23:11:06 2013
*filter
:OUTPUT ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -d 127.0.0.0/8 ! -i lo -j DROP
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport <your SSH port> -j ACCEPT
-A INPUT -p icmp -m icmp --icmp-type 8 -j ACCEPT
-A INPUT -j LOG
-A INPUT -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -j ACCEPT
COMMIT
# Completed on Tue Feb 19 23:11:06 2013

this is the basic configuration I usually start with, then I open ports or make changes as required. To apply these firewall rules right away, run

iptables-restore < /etc/iptables.rules

To ensure these rules are also applied each time the server starts, edit /etc/network/interfaces and add post-up iptables-restore < /etc/iptables.rules:

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
address ...
netmask ...
broadcast ...
network ...
gateway ...
post-up iptables-restore < /etc/iptables.up.rules

Of course make sure you specify your correct network settings here.

Dependencies

I install the same dependencies required when installing Percona XtraDB MySQL Cluster from source code, so to be sure nothing is missed.

apt-get install curl build-essential flex bison automake autoconf \
bzr libtool cmake libaio-dev libncurses-dev zlib1g-dev libdbi-perl \
libnet-daemon-perl libplrpc-perl libaio1

There’s one more dependency you need, but it is only available from Percona’s repositories, so we need to add them to apt’s sources:

gpg --keyserver hkp://keys.gnupg.net --recv-keys 1C4CBDCDCD2EFD2A
gpg -a --export CD2EFD2A | sudo apt-key add -

Then edit /etc/apt/sources.list and append the following lines to enable these repositories:

deb http://repo.percona.com/apt lucid main
deb-src http://repo.percona.com/apt lucid main

Lastly, to install the remaining dependency:

apt-get update
apt-get install libmysqlclient18

Installing the Percona packages

It’s time now to install the Percona packages; you’ll need to install both packages for the Percona server and for Xtrabackup, a hot backup tool also from Percona which I cover in more detail in another post. You will need Xtrabackup if you use this tool as the strategy to use for the provisioning of nodes in the MySQL cluster – more on this later.

You can download the packages to install the Percona server from here and the one required to install Xtrabackup from here. At the moment of this writing, the latest versions available are 5.5.29-23.7.2-389 for Percona server and 2.0.5-499 for Xtrabackup. I am using Ubuntu Lucid x86-amd64 so in the following example I am downloading the packages for this version:

cd /usr/local/src

wget http://www.percona.com/redir/downloads/Percona-XtraDB-Cluster/LATEST/deb/lucid/x86_64/percona-xtradb-cluster-common-5.5_5.5.29-23.7.2-389.lucid_all.deb
wget http://www.percona.com/redir/downloads/Percona-XtraDB-Cluster/LATEST/deb/lucid/x86_64/percona-xtradb-cluster-client-5.5_5.5.29-23.7.2-389.lucid_amd64.deb
wget http://www.percona.com/redir/downloads/XtraBackup/LATEST/deb/lucid/x86_64/percona-xtrabackup_2.0.5-499.lucid_amd64.deb
wget http://www.percona.com/redir/downloads/Percona-XtraDB-Cluster/LATEST/deb/lucid/x86_64/percona-xtradb-cluster-galera-2.x_143.lucid_amd64.deb
wget http://www.percona.com/redir/downloads/Percona-XtraDB-Cluster/LATEST/deb/lucid/x86_64/percona-xtradb-cluster-server-5.5_5.5.29-23.7.2-389.lucid_amd64.deb

Then, install these packages and stop the MySQL/Percona server since we need to configure the server as the first node of the cluster.

dpkg -i percona*.deb
service mysql stop

MySQL configuration

Next, edit the MySQL configuration at /etc/mysql/my.cnf and paste the content of this gist which already includes the required configuration for the MySQL cluster nodes. An important note is in order here: the configuration in that gist is what I am currently using with a small MySQL cluster in which each node has 8 GB of ram, so you may want to tweak some settings depending on your case. I have included them as they have worked pretty well for me. You could just include the settings in the Galera synchronous replication section and you’d be fine as far as the configuration of the MySQL cluster is concerned. So it’s up to you if you want to try the other settings too.

Notes on some of the settings in the gist:

  • max-connections: this setting really depends on many things. I’ve set it to 500 but the correct value depends on how you will be using MySQL;
  • transaction-isolation: MySQL’s default setting is REPEATABLE-READ which isn’t optimal; I prefer READ-COMMITTED (which happens to be the default setting in PostgreSQL instead);
  • skip-name-resolve: prevents the server from performing a DNS lookup each time a client connects to it, speeding up connections a little bit;
  • innodb_support_xa: this is required by the Galera replication;
  • innodb_import_table_from_xtrabackup: it allows restores of single tables by replacing the tablespace even at runtime, which can be pretty handy when you don’t need to restore the whole database;
  • innodb_log_file_size: I can’t remember exactly how I determined the value of 50M but the important thing to keep in mind concerning this setting is that you won’t be able to use a datadir with InnoDB logs created with a different value (so you’d have to delete the existing logs and restart MySQL if you change the value);
  • innodb_file_per_table: this is a recommended setting for two reasons: it uses disk space better by storing the data in separate files for the various tables vs a single gigantic file that can become bloated overtime; it also allows for restores of single tables together with the previous setting;

As for the Galera synchronous replication section, you should basically use those settings as they are apart from:

  • server-id: this has to be a unique id for each node; you can choose any arbitrary value;
  • wsrep_cluster_name: of course this is the name you want to give to the MySQL cluster; it’s important that all nodes in the cluster have the same value;
  • wsrep_node_name: this as well should be different for each node; I usually use names such as db1,db2,…,dbN or node1,node2,…,nodeN;
  • wsrep_slave_threads: the recommended setting is 4 threads per CPU core;
  • wsrep_cluster_address: this very important setting determines the role of a node in the MySQL cluster; as we’ll see later, this should be set to gcomm:// on the first node when bootstrapping a new cluster. Once the cluster is ready and all the nodes have been configured, it is convenient to have the setting with value gcomm://db1,db2,…,dbN on each node instead; this makes it so a node, when restarted or rebuilt, will automatically try one node per time in the list to find a node that is available and ‘synced’, so that node can become its ‘donor’ when the first node joins or rejoins the cluster;
  • wsrep_sst_method: this determines the synchronisation strategy to use when a node joins or rejoins the MySQL cluster after being offline for maintenance or else; at the moment I tend to use the rsync strategy as it seems to be somewhat more stable, but another good option is Percona’s own xtrabackup; the main difference is that with the rsync strategy both joiner and donor are seen as unavailable during the transfer of data, while with xtrabackup the donor is supposed to be available. I haven’t yet tried this though.

So go ahead with bootstrapping the MySQL cluster with the first node you’ve just set up, by setting wsrep_cluster_address to gcomm://. Then restart MySQL, which should now apply all the settings in the /etc/mysql/my.cnf configuration file. Before restarting though, if you have set innodb_log_file_size to some custom value, you’ll need to delete the existing InnoDB log files otherwise MySQL won’t start. MySQL’s datadir is by default /var/lib/mysql so to delete the old log files you can run:

rm /var/lib/mysql/ib_logfile*
service mysql restart

If MySQL fails to restart, try starting it “manually” with

mysqld -u mysql

which will show information that may useful to debug the problem. Otherwise, the first node is now ready and you can go ahead with adding a node per time – at least two more nodes for an optimal configuration.

Adding nodes to the MySQL cluster

Adding more nodes to the cluster is almost an identical process to that required to set up the very first node, apart from a few key differences. In the MySQL’s configuration for each new node, make sure that

  • server-id and wsrep_node_name have different and unique values, i.e. not in use by any other nodes in the MySQL cluster;
  • wsrep_cluster_address: it should be set to the address of the first node or anyway one node already synced and available to be used as donor, so the joiner can receive data from it.

Having updated the MySQL configuration, stop MySQL for now on the joiner node(s), and update the the firewall configuration on all nodes so that they can eventually exchange data with each other. I usually prefer using hostnames or aliases rather than IP addresses in iptables’ configuration, since it’s easier to see at a glance what each rule is for. So open /etc/hosts on each node and add entries for the IPs in use by the other nodes. For example, if I am on node1 in a 3-nodes MySQL cluster, I’d change the /etc/hosts file so it looks something like

127.0.0.1 localhost
127.0.1.1 node1.localdomain node1

# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

10.211.55.7 node2
10.211.55.8 node3

The two lines at the bottom are the important ones (of course make sure you specify the correct IP addresses). Next, we need to update the firewall rules. Open again /etc/iptables.rules and add the following rules before the -A input -j REJECT rule:

-A INPUT -i eth0 -p tcp -m tcp --source node2 --dport 4567 -j ACCEPT
-A INPUT -i eth0 -p tcp -m tcp --source node2 --dport 4568 -j ACCEPT
-A INPUT -i eth0 -p tcp -m tcp --source node2 --dport 4444 -j ACCEPT
-A INPUT -i eth0 -p tcp -m tcp --source node2 --dport 3306 -j ACCEPT

Explanation: 4567 is the port another node will knock to to check whether this node is synced and available to become its donor; 4568 is unsed when an incremental state transfer (IST) is possible as opposed to a snapshot state trasnfer (SST) which basically is the copy of all of the data – whether IST is possible or not depends on how much the data on the joiner differsfrom the data on the donor; 4444 is the port used with the rsync strategy, while 3306 is of course the default port at which MySQL listens for clients.

In the example above, I was on node1 so I added rules for node2. It is important to have those four rules replicated for each of the other nodes in the MySQL cluster, so to allow each node to exchange data with any of the other nodes. To apply the changes right away, run

iptables-restore < /etc/iptables.rules

Done this, you can start MySQL on one joiner per time and the it will start receiving data from the donor you have specified in my.cnf. Once all the nodes are up and running and synced, I recommend you set wsrep_cluster_address to gcomm://node1,node2,..nodeN. This way you don’t have to change that setting each time you take a node offline and then online again for maintenance or else, because the joiner will automatically find the first node in that list which is available to provide it with the data. If all went well, when you start a new node just configured you can see it becomes a joiner and receives data from the donor by watching the MySQL related processes (e.g. you could monitor this with watch “ps waux | grep mysql”):

root 5167 0.1 0.1 19396 1952 pts/2 S+ 12:04 0:00 /bin/bash /etc/init.d/mysql start
root 5195 0.1 0.0 4108 728 pts/2 S+ 12:04 0:00 /bin/sh /usr/bin/mysqld_safe
mysql 5837 0.5 3.3 245612 33980 pts/2 Sl+ 12:04 0:00 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/lib/mysql/node3.err --pid-file=/var/lib/mysql/node3
mysql 5884 0.0 0.0 4108 612 pts/2 S+ 12:04 0:00 sh -c wsrep_sst_rsync --role 'joiner' --address '10.211.55.8' --auth '' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '5837'
mysql 5886 0.2 0.1 19248 1764 pts/2 S+ 12:04 0:00 /bin/bash -ue /usr//bin/wsrep_sst_rsync --role joiner --address 10.211.55.8 --auth --datadir /var/lib/mysql/ --defaults-file /etc/mysql/my.cnf --parent 5837
mysql 5909 0.0 0.0 10984 676 ? Ss 12:04 0:00 rsync --daemon --port 4444 --config /var/lib/mysql//rsync_sst.conf

In the example above I was using the rsync strategy; the output would look slightly different if you used the xtrabackup strategy instead. This is instead you would see on the donor while SST is happening:

root 746 0joiner.0 0.0 4108 688 ? S 11:38 0:00 /bin/sh /usr/bin/mysqld_safe
mysql 1448 0.1 10.6 1118380 108624 ? Sl 11:38 0:03 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/lib/mysql/node2.err --pid-file=/var/lib/mysql/node2
mysql 6938 0.0 0.0 4108 616 ? S 12:22 0:00 sh -c wsrep_sst_rsync --role 'donor' --address '10.211.55.8:4444/rsync_sst' --auth '(null)' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-fil
mysql 6939 1.0 0.1 17732 1592 ? S 12:22 0:00 /bin/bash -ue /usr//bin/wsrep_sst_rsync --role donor --address 10.211.55.8:4444/rsync_sst --auth (null) --socket /var/run/mysqld/mysqld.sock --datadir /var/lib/mysql/ --defa
mysql 6949 33.0 joiner 0.1 21112 1636 ? R 12:22 0:00 rsync --archive --no-times --ignore-times --inplace --delete --quiet --whole-file -f + /ib_lru_dump -f + /ibdata* -f + /ib_logfile* -f + */ -f -! */* /var/lib/mysql/ rsync:/

Once you have configured all the nodes your shiny new MySQL cluster is ready to be used as it is -yay!- but in order to take full advantage of it you will need to split reads/and writes either in your application or with a load balancer such as haproxy, which I cover next.

Load balancing and failover

Once you have two or (better) more nodes in the MySQL cluster, you could already use it as is and split connections or reads/writes at application level; however it’s perhaps easiest to use something like haproxy that will handle this for you and will also ensure nodes that are not in sync are ignored. Setting this up is quite easy; first, unless you have haproxy already installed, you can install it with

apt-get install haproxy

Next, edit /etc/haproxy/haproxy.cfg and paste the following lines:

listen mysql-cluster 0.0.0.0:3306
mode tcp
balance leastconn
option tcpka
option httpchk

server db1 node1:3306 check port 9200 inter 5000 rise 3 fall 3 maxconn 400
server db2 node2:3306 check port 9200 inter 5000 rise 3 fall 3 maxconn 400
server db3 node3:3306 check port 9200 inter 5000 rise 3 fall 3 maxconn 400

Ensure you have listed all the nodes; do not restart haproxy yet. First, we need to configure a service on each node that haproxy will use to monitor the nodes and automatically ignore nodes that are offline or not in sync with the rest of the MySQL cluster. This is typically done with xinetd, although there are certainly other ways to achieve the same result. If you don’t have xinetd installed yet, run apt-get install xinetd, then create the new file /etc/xinetd.d/mysqlchk, if it doesn’t exist yet (it appears that this configuration is now done automatically in the newest version of Percona MySQL cluster) and paste the following:

# default: on
# description: mysqlchk
service mysqlchk
{
# this is a config for xinetd, place it in /etc/xinetd.d/
disable = no
flags = REUSE
socket_type = stream
port = 9200
wait = no
user = nobody
server = /usr/bin/clustercheck
log_on_failure += USERID
only_from = 0.0.0.0/0
# recommended to put the IPs that need
# to connect exclusively (security purposes)
per_source = UNLIMITED
}

Next, edit /usr/bin/clustercheck which may exist or not depending on the version of the MySQL cluster you have set up. If the file exists, just ensure that the variables MYSQL_USERNAME and MYSQL_PASSWORD are set to the correct MySQL credentials. If the file doesn’t elready xist instead, create it and paste the following:

MYSQL_USERNAME="..."
MYSQL_PASSWORD="..."
ERR_FILE="/dev/null"
AVAILABLE_WHEN_DONOR=0

#
# Perform the query to check the wsrep_local_state
#
WSREP_STATUS=`mysql --user=${MYSQL_USERNAME} --password=${MYSQL_PASSWORD} -e "SHOW STATUS LIKE 'wsrep_local_state';" 2>${ERR_FILE} | awk '{if (NR!=1){print $2}}' 2>${ERR_FILE}`

if [[ "${WSREP_STATUS}" == "4" ]] || [[ "${WSREP_STATUS}" == "2" && ${AVAILABLE_WHEN_DONOR} == 1 ]]
then
# Percona XtraDB Cluster node local state is 'Synced' => return HTTP 200
/bin/echo -en "HTTP/1.1 200 OK\r\n"
/bin/echo -en "Content-Type: text/plain\r\n"
/bin/echo -en "\r\n"
/bin/echo -en "Percona XtraDB Cluster Node is synced.\r\n"
/bin/echo -en "\r\n"
exit 0
else
# Percona XtraDB Cluster node local state is not 'Synced' => return HTTP 503
/bin/echo -en "HTTP/1.1 503 Service Unavailable\r\n"
/bin/echo -en "Content-Type: text/plain\r\n"
/bin/echo -en "\r\n"
/bin/echo -en "Percona XtraDB Cluster Node is not synced.\r\n"
/bin/echo -en "\r\n"
exit 1
fi

If you run /usr/bin/clustercheck manually on an active, synced node, you’ll see the following output as expected:

HTTP/1.1 200 OK
Content-Type: text/plain

Percona XtraDB MySQL Cluster Node is synced.

Now restart xinetd with /etc/init.d/xinetd restart and then test that the script can also be run via the port specified in the xinetd configuration (9200):

root@node1:~# telnet localhost 9200
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
HTTP/1.1 200 OK
Content-Type: text/plain

Percona XtraDB MySQL Cluster Node is synced.

Connection closed by foreign host.

Now you can reload haproxy as well with

service haproxy reload

and ensure your applications connect to the load balancer instead of any nodes of the MySQL cluster directly. One last thing I’d like to suggest which I find very useful, is to use haproxy’s web interface to check the status of nodes, especially when you take one node offline for maintenance and want to check that it rejoins the cluster correctly when that’s done. Edit /etc/haproxy/haproxy.cfg again and add the following lines (ensure you use a good combination of username and password and optionally use a custom port):

listen stats 0.0.0.0:8282
mode http
option httpchk
balance roundrobin
stats uri /
stats refresh 10s
stats realm Haproxy\ Statistics
stats auth username:password

Once you reload haproxy again, you will be able to see the status of the MySQL cluster’s nodes from the UI at the port specified (8282 or whichever you have chosen):

screen-shot-2013-03-17-at-19-56-50-png-1c6b9a

Testing

Testing the MySQL cluster is quite easy: just take a node offline or kill -9 MySQL and delete all thata on a node, and see what happens when you restart MySQL :p

Conclusion

I think that despite both Galera and Percona XtraDB Cluster are relatively new, this combination is definitely the best setup I have worked with so far for MySQL databases; it’s nice to have the peace of mind that nodes can be taken offline at any time for maintenance and have them resynced automatically without downtime, and at the same time scale reads and to some extent writes too. I will certainly play again with alternatives such as MongoDB and similar, but I have been using a MySQL cluster with Percona/Galera in production for a while now and it’s been pretty stable requiring very little maintenance, so that’s the reason why for the time being I will stick to MySQL rather than rethinking the applications I am working on so to adapt them to other solutions. I will, however, very likely look into similar clustering solutions for PostgreSQL since I am getting more and more interested in this database these days.

I would be interested to hear others’ experiences with a MySQL cluster with Percona/Galera or any alternatives that have worked well for them.

Recovering from MySQL replication stops caused by failing statements

Sometimes it happens that the MySQL replication stops because some statements received from the master fail when replayed on a slave. In such cases, depending on the nature of the failure you may need to rebuild the slave from scratch using a backup from the master (I’d recommend Percona’s Xtrabackup for this, as explained in an earlier post), or try to re-sync the slave using something like pt-table-sync again from Percona (these tools were originally known as Maatkit).

Luckily, however, this isn’t always the case as sometimes the failing statements don’t affect the data or might only affect data that isn’t important or is removed frequently (for example tables containing temporary data); in such cases then you may not need to rebuild the slave or re-sync with the Percona tools (which perform a time consuming comparison operation -especially on large tables- and might cause high load on both master and slave, up to causing locks and various issues depending on the usage of the servers at the time and the resources available); you can instead get the slave up and running again very quickly by simply skipping the failing statements, provided that these statements can be safely ignored as well as any data they might affect.

This can be done by temporarily stopping the slave, then instructing MySQL to skip one or more statements received from the master, and eventually starting the slave again. For example, let’s assume that we find out that the replication is temporarily stopped due to some statement causing errors:

mysql> show slave status \G
*************************** 1. row ***************************
...
Slave_IO_Running: Yes
Slave_SQL_Running: No
...
Last_Errno: (some error code)
Last_Error: (description)
...
Seconds_Behind_Master: NULL
...
Last_SQL_Errno: (some error code)
Last_SQL_Error: Error ... (description)
...
1 row in set (0.00 sec)

Note the Slave_SQL_Running status, the error details, and Seconds_Behind_Master.

To fix you’d stop the slave now, and tell MySQL to skip at least the next statement (which we know for sure that it is failing) by setting the global variable SQL_SLAVE_SKIP_COUNTER. You’d then start the slave again:

mysql> stop slave;
Query OK, 0 rows affected (0.10 sec)

mysql> set global SQL_SLAVE_SKIP_COUNTER = 1;
Query OK, 0 rows affected (0.00 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

Assuming, as in the example, that only one statement is failing, by skipping that single statement the slave can happily start to replicate again:

mysql> show slave status \G
*************************** 1. row ***************************
...
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
...
Last_Errno: 0
Last_Error:
...
Seconds_Behind_Master: 5341
...
Last_SQL_Errno: 0
Last_SQL_Error:
...
1 row in set (0.00 sec)

Done this, you’d only need to wait for the slave to catch up with the master, and how long that’ll take depends on how long the replication has been in a stopped state.

If only one statement were failing, you’d be lucky. At times there can be many more failing statements really depending on the replicated databases and their usage. So in these cases you could still skip one statement per time until the replication is fine again, but you wouldn’t have any way to know how many of the next statements received from the master will eventually fail. Another option is to set SQL_SLAVE_SKIP_COUNTER to a value greater than 1, so that MySQL will skip multiple statements at once, but since you don’t know how many statements will fail, there’s the risk that some other statements that otherwise wouldn’t fail will also be missed by the slave affecting some important data.

You can avoid this by still skipping one failing statement per time, but rather than doing this manually you can automate this with very simple shell commands. For starters, let’s save somehow the credentials to connect to the MySQL instance, e.g. in some variables:

MYSQL_USER=...
MYSQL_PASSWORD=...

Next, let’s save with an alias a simple command line to find out whether the slave is currently running:

alias slave-check='mysql -u$MYSQL_USER -p$MYSQL_PASSWORD -e "show slave status \G" | \
grep "Slave_SQL_Running: Yes" | \
wc -l'

So now we can just run the command slave-check and that’ll return either 1 or 0 depending on whether the replication is running or not. Similarly, we can save with another alias a command line that tells MySQL to stop the slave, skip one statement and restart the slave:

alias skip-one='mysql -u$MYSQL_USER -p$MYSQL_PASSWORD \
-e "set global SQL_SLAVE_SKIP_COUNTER = 1; \
stop slave; \
start slave; \
select sleep(1);"'

So we can just run skip-one to automate this in a very simple way. Note that I’ve added a select sleep(1) query too since sometimes the slave might appear as running for a brief time after skipping one statement, although the slave is stopped again soon after due to other failing statements.

Finally, we can save a third command line that keeps skipping failing statements one per time until the replication is running again:

alias skip-failing-statements="while [ `slave_check` = 0 ]; do skip-one; done"

If you save these simple commands in your shell profile for example, you can recover a stopped slave (when possible) by simply running skip-failing-statements, That’s it. It’s a stupid simple tip but it might be useful to someone who is new to MySQL replication or isn’t too familiar with the command line.

Resque: automatically kill stuck workers and retry failed jobs

Resque is a great piece of software by Github that makes it really easy to perform some operations (‘jobs’) asynchronously and in a distributed way across any number of workers. It’s written in Ruby and backed by the uber cool Redis key-value data store, so it’s efficient and scalable. I’ve been using Resque in production for a couple years now after it replaced Delayed Job in my projects, and love it. If your projects do something that could be done asynchronously, your really should check it out if you haven’t yet.

At OnApp we’ve been using Resque for a while to process background jobs of various types, with great results: in a few months, we’ve processed a little over 160 million jobs (at the moment of this writing), and out of this many only 43K jobs have been counted as failed so far. However, many of these failed jobs have been retried successfully at a successive attempt, so the number of jobs that actually failed is a lot smaller, perhaps a very few thousands.

Out of 160M+ jobs, it’s a very small percentage of failures. But despite the system, for the most part, has been rock solid so far, jobs can still fail every now and then depending on the nature of the jobs, excessive load on the worker servers, temporary networking and timeout issues or design related issues such as race conditions and alike. Sometimes, you will also find that workers can get “stuck”, requiring (usually) manually intervention (as in: kill / restart the workers, manually sort out failed jobs).

So I wanted to share a simple script I am using in production to automatically find and kill these “stuck” workers and then retry any jobs that are found as ‘failed’ due to the workers having been killed, or else. The purpose is to keep workers running and minimise the need for manual intervention when something goes wrong.

Please note that I use resque-pool to manage a pool of workers more efficiently on each worker server. Therefore if you manage your workers in a different way, you may need to adapt the script to your configuration.

You can find the little script in this gist, but I’ll briefly explain here how it works. It’s very simple, really. First, the script looks for the processes that are actually working off jobs:

root@worker1:/scripts# ps -eo pid,command | grep [r]esque
10088 resque-pool-master: managing [10097, 10100, 10107, 10113, 10117, 10123, 10138, 10160, 10167, 10182, 10195]
10097 resque-1.20.0: Forked 16097 at 1337878130
10100 resque-1.20.0: Forked 16154 at 1337878131
10107 resque-1.20.0: Waiting for cdn_transactions_collection
10113 resque-1.20.0: Waiting for usage_data_collection
10117 resque-1.20.0: Waiting for usage_data_collection
10123 resque-1.20.0: Waiting for check_client_balance
10138 resque-1.20.0: Waiting for check_client_balance
10160 resque-1.20.0: Waiting for geo_location
10167 resque-1.20.0: Forked 16160 at 1337878131
10182 resque-1.20.0: Forked 16163 at 1337878132
10195 resque-1.20.0: Waiting for services_coordination
16097 resque-1.20.0: Processing push_notifications since 1337878130
16163 resque-1.20.0: Processing push_notifications since 1337878132

This is an example from one of our worker servers. The Processing processes are those that are actually working off jobs, so these are the ones we are after since these are the processes that can get “stuck” sometimes for a reason or another. So the script first looks for these processes only, ignoring the rest:

root@worker1:/scripts# ps -eo pid,command | grep [r]esque | grep Processing
18956 resque-1.20.0: Processing push_notifications since 1337878334
19034 resque-1.20.0: Processing push_notifications since 1337878337
19052 resque-1.20.0: Processing usage_data_collection since 1337878338
19061 resque-1.20.0: Processing usage_data_collection since 1337878338
19064 resque-1.20.0: Processing usage_data_collection since 1337878339
19066 resque-1.20.0: Processing usage_data_collection since 1337878339

Next, the script loops through these processes, and looks for those that have been running for over 50 seconds. You may want to change this threshold, but in our case all jobs should usually complete in a few seconds, so if some jobs are still found after almost a minute, something is definitely going on.

ps -eo pid,command |
grep [r]esque |
grep "Processing" |
while read PID COMMAND; do
if [[ -d /proc/$PID ]]; then
SECONDS=`expr $(awk -F. '{print $1}' /proc/uptime) - $(expr $(awk '{print $22}' /proc/${PID}/stat) / 100)`

if [ $SECONDS -gt 50 ]; then
kill -9 $PID
...

QUEUE=`echo "$COMMAND" | cut -d ' ' -f 3`

echo "
The forked child with pid #$PID (queue: $QUEUE) was found stuck for longer than 50 seconds.
It has now been killed and job(s) flagged as failed as a result have been re-enqueued.

You may still want to check the Resque Web UI and the status of the workers for problems.
" | mail -s "Killed stuck Resque job on $(hostname) PID $PID" email@address.com

...
fi
fi
done

I was looking for a nice and easy way to find out how long (in seconds) a process had been running, and the expression you see in the code snippet above was the nicest solution I could find (hat tip to joseph for this).

If any of the Resque processes that are working off jobs are found running for longer than 50 seconds, then these are killed without mercy and a notification is sent to some email address just in case.

First, this way we don’t actually kill Resque workers, but other processes forked by the workers in order to process jobs. This means that the workers remain up and running and soon after they’ll fork new processes to work off some other jobs from the queue(s) they are watching. This is the nicest part, in that you don’t need to manually kill the actual workers and then restart them in order to keep the worker servers going.

Second, killing those processes will cause the jobs that they were processing to fail, so they will appear in Resque’s “failed jobs” queue. The second part of the script takes care of this by running a rake task that re-enqueues all failed jobs and clears the failed jobs queue. For starters, you’ll need to add this rake task to your application. If you are already using Resque, you will likely have a lib/tasks/resque.rake file, otherwise you’ll have to create one (I’m assuming here it’s a Rails application).

In any case, add the following task to that rake file:

desc "Retries the failed jobs and clears the current failed jobs queue at the same time"
task "resque:retry-failed-jobs" => :environment do
(Resque::Failure.count-1).downto(0).each { |i| Resque::Failure.requeue(i) }; Resque::Failure.clear
end

Back to the script, if it finds and kills any workers that it found stuck, it then proceeds to run the above rake task so to retry the failed jobs:

ps -eo pid,command |
grep [r]esque |
grep "Processing" |
while read PID COMMAND; do
if [[ -d /proc/$PID ]]; then
SECONDS=`expr $(awk -F. '{print $1}' /proc/uptime) - $(expr $(awk '{print $22}' /proc/${PID}/stat) / 100)`

if [ $SECONDS -gt 50 ]; then
...
touch /tmp/retry-failed-resque-jobs
...
fi
fi
done

if [[ -f /tmp/retry-failed-resque-jobs ]]; then
/bin/bash -c 'export rvm_path=/usr/local/rvm && export HOME=/home/deploy && . $rvm_path/scripts/rvm && cd /var/www/sites/dashboard/current/ && /usr/local/bin/rvm rvmrc load && RAILS_ENV=production bundle exec rake resque:retry-failed-jobs'
fi

You may notice that I am forcing the loading of RVM before running the rake task; this is because I need to upgrade some stuff on the worker servers, but you may not need to run the rake task this way.

This is basically it: the script just kills the stuck workers and retries the failed jobs without requiring manual intervention; in almost all cases, I don’t have to worry anymore about them besides wondering whether there’s a design issue that might cause workers to get stuck and that therefore need to be addressed (which is a good reason to keep an eye on the notifications). There might be other monitoring solutions of various types out there, but this simple script is what has been working best for me so far on multiple worker servers with tens of workers.

The final step is to ensure that this script runs frequently so to fix problems as soon as they arise. The script is extremely lightweight, so in my case I just schedule it (with cron) to run every minute on each server.

Know of a better way of achieving the same result? Please do let me know in the comments.

Painless, ultra fast hot backups and restores of MySQL databases with Percona’s XtraBackup

A better alternative to mysqldump?

In the previous post, we’ve seen how to perform backups and restores of MySQL databases using the mysqldump tool bundled with the MySQL server distribution, together with a few tricks that help speed up both backups and restores, among other things.

mysqldump is very likely the most popular tool for backing up MySQL databases, and in most cases it just works, and pretty well too. It’s easy to use, free, comes with MySQL and produces simple text files containing standard DROP/CREATE/INSERT SQL statements that, once replayed during a restore, will simply recreate the databases, the tables and all the data on the destination MySQL server. By producing simple text files based on SQL statements, mysqldump may also be the perfect tool when migrating data between different versions of MySQL, or when you need to change something in the text file dump before restoring it.

However, mysqldump may not be up to the job if you need to back up MySQL databases that you want to keep active without any downtime, and while they are in use for writes as well as reads; for example, when the databases are also replicated and you need to copy them to another server instance to set up a new slave or resync an existing one. In these circumstances, if you resort to using mysqldump anyway you may have to choose between locking all the tables while backing up -thus blocking writes in the meantime, which in some cases may mean breaking applications- and having dumps that may contain inconsistent data. But you unlikely want your users to experience downtime, and in any case you don’t want that your databases may change while your are exporting the data, do you?

Another issue is performance. While mysqldump works great with small databases, it is not efficient with large ones and restoring a SQL dump of a large database can take a very, very long time. With the tricks we’ve seen in the previous post, it is possible to speed up both dumps and restores significantly, but still this would not work well for very large databases. So, unless you are very patient, don’t mind waiting and can afford to stay sitting a week watching your MySQL server while it’s restoring your large db, sooner or later you will have to look for other, more efficient options.

LVM snapshots

Among the other options, there are both commercial and free tools. One popular (free) choice is to use LVM snapshots; they work well when you can use them (for example, you may not be able to use LVM snapshots with virtual private servers, in many cases, unless you are given the possibility to customise the disk layout setup), but they’ve also got their weaknesses; we’ll look at this option more in detail in a future post, however suffice it to say here that LVM snapshots may significantly impact on the disks, and therefore on the performance of the whole server. Plus, LVM snapshots only take into account what MySQL has already flushed to disk at the moment, but not what data it may have in memory, thus restoring from an LVM snapshot may require a crash recovery.

Hot backups!

A better alternative yet is MySQL Enterprise Backup (formerly InnoDB Hot Backup), which can backup a live MySQL server without downtime and at the same time ensure that our backups are consistent even if the databases are in use for writes too during the process. Plus, restores are incredibly fast. Sounds great, yeah? Unfortunately, MySQL Enterprise Backup is a pricey commercial product that may be beyond reach in many cases (5K USD per server), especially for companies with tight budgets that may choose MySQL for the its cost (none) in first place, as well as for its good performance.

But fear not, if you are on the cheap like me: luckily for us, there also exists a free, open source solution by the good people at Percona (a MySQL consultancy company), which is very, very similar to Enterprise Backup/InnoDB Hot Backup. It’s called Xtrabackup, and while it’s still pretty young having been first developed less than a couple years ago, it’s a pretty solid solution that works very well despite some limitations. It also compares well to the commercial alternative in that it only takes a bit longer to perform backups of the same data, and requires a little more disk space – I haven’t had a chance to test this myself yet, so I’ll trust what I’ve read on pythian.com: Taste test: Innobackup vs. Xtrabackup.

Here are the main features available with the current release, from the home page:

  • Create hot InnoDB backups without pausing your database
  • Make incremental backups of MySQL
  • Stream compressed MySQL backups to another server
  • Move tables between MySQL servers online
  • Create new MySQL replication slaves easily
  • Backup MySQL without adding load to the server

While the product itself is free, Percona also provides commercial support, if needed. So, let’s see now how to use Xtrabackup for backing up and restoring MySQL databases.

Installing

For starters, you’ll need to install XtraBackup. In this example, we’ll install a Debian package on a Ubuntu server. Head to XtraBackup’s download page first, or download the latest version available (1.4 at the moment of this writing) with wget or similar making sure you choose the right package for your OS.

wget https://www.percona.com/downloads/XtraBackup/XtraBackup-1.4/deb/lucid/x86_64/
xtrabackup_1.4-193.lucid.25_amd64.deb
dpkg -i xtrabackup_1.4-193.lucid.25_amd64.deb
Backing up

The package installs a binary named xtrabackup that takes care of the actual backups and restores, but it also installs a perl script named innobackupex-1.5.1 that makes XtraBackup a bit easier to use, plus it also backs up MyISAM tables. Backing up is pretty simple:

innobackupex-1.5.1 --user=... --password=... --slave-info \
--defaults-file=/etc/mysql/my.cnf --databases="db1 db2" --stream=tar ./ \
| gzip -c -1 > /backups/backup.`date +%Y-%m-%d-%H-%M-%S`.tar.gz

You should see an output similar to this:

This software is published under
the GNU GENERAL PUBLIC LICENSE Version 2, June 1991.
IMPORTANT: Please check that the backup run completes successfully.
At the end of a successful backup run innobackupex-1.5.1
prints "completed OK!".
innobackupex-1.5.1: Using mysql Ver 14.14 Distrib 5.1.41, for debian-linux-gnu (i486)
using readline 6.1
innobackupex-1.5.1: Using mysql server version Copyright 2000-2008 MySQL, 2008 Sun
Microsystems, Inc.
innobackupex-1.5.1: Created backup directory /backups
101214 21:47:59 innobackupex-1.5.1: Starting mysql with options:
--defaults-file="/etc/mysql/my.cnf" --password=... --user=... --unbuffered --
101214 21:47:59 innobackupex-1.5.1: Connected to database with mysql child process
(pid=17917)
101214 21:48:03 innobackupex-1.5.1: Connection to database server closed
101214 21:48:03 innobackupex-1.5.1: Starting ibbackup with command: xtrabackup
--defaults-file="/etc/mysql/my.cnf" --backup --suspend-at-end --log-stream
--target-dir=./
innobackupex-1.5.1: Waiting for ibbackup (pid=17944) to suspend
innobackupex-1.5.1: Suspend file '/var/lib/mysql/xtrabackup_suspended'
xtrabackup: suspend-at-end is enabled.
xtrabackup: uses posix_fadvise().
xtrabackup: cd to /var/lib/mysql
xtrabackup: Target instance is assumed as followings.
xtrabackup: innodb_data_home_dir = ./
xtrabackup: innodb_data_file_path = ibdata1:10M:autoextend
xtrabackup: innodb_log_group_home_dir = ./
xtrabackup: innodb_log_files_in_group = 2
xtrabackup: innodb_log_file_size = 5242880
xtrabackup: Stream mode.
>> log scanned up to (92295311910)


xtrabackup: Transaction log of lsn (92294278211) to (92298258089) was copied.
101214 22:47:23 innobackupex-1.5.1: All tables unlocked
101214 22:47:23 innobackupex-1.5.1: Connection to database server closed
innobackupex-1.5.1: Backup created in directory '/backups'
innobackupex-1.5.1: MySQL binlog position: filename 'mysql-bin.000730',
position 16838362
innobackupex-1.5.1: MySQL slave binlog position: master host '192.168.162.110',
filename 'mysql-bin.000651', position 99728693
innobackupex-1.5.1: You must use -i (--ignore-zeros) option for extraction
of the tar stream.
101214 22:47:23 innobackupex-1.5.1: completed OK!

In this first example, we have instructed XtraBackup to stream the backup data in tar format; the output is then compressed with gzip into a file having the current timestamp in its name. You may have noticed that I am using in the example above gzip’s lowest compression level, this is because higher levels will only require more CPU and resources in general with little improvement, especially if you pipe the compression to XtraBackup while the backup is still in progress; actually I have had often problems with compressing the backup data while the backup is still running; in many cases I found that backing up first, and then compressing the backup once this is done, works better. However the success of compressing while backing up at the same time, also depends -from my experience with the tool- on the value of the wait_timeout system variable in MySQL. On my servers, whenever I have set this variable to a low value I’ve always had problems getting backup and compression to work in one step. There may be better settings or workarounds, but I’ve found that setting wait_timeout to a value higher than the average time taken by the backup process, usually makes for a successful compressed backup without requiring a separate step for the compression. However, keep in mind that a high value for wait_timeout can cause a whole lot of other issues depending on the usage pattern of your MySQL instances.

For this reason, you may want to keep wait_timeout to a not too high value, and only change it for the XtraBackup transaction. You can easily do this by patching the innobackupex-1.5.1 script as suggested by reader “Gu Lei” in a bug somebody filed exactly for the same issue. So edit the file /usr/bin/innobackupex-1.5.1 and change it as follows:

if (compare_versions($mysql_server_version, '4.0.22') == 0
|| compare_versions($mysql_server_version, '4.1.7') == 0) {
# MySQL server version is 4.0.22 or 4.1.7
mysql_send "COMMIT;";
mysql_send "set interactive_timeout=28800;"; # ADD THIS
mysql_send "set wait_timeout=28800;"; # ADD THIS
mysql_send "FLUSH TABLES WITH READ LOCK;";
} else {
# MySQL server version is other than 4.0.22 or 4.1.7
mysql_send "set interactive_timeout=28800;"; # ADD THIS
mysql_send "set wait_timeout=28800;"; # ADD THIS
mysql_send "FLUSH TABLES WITH READ LOCK;";
mysql_send "COMMIT;";
}

Of course, the timeout value must be enough to cover the amount of time XtraBackup could possibly take to backup your data.

The arguments –slave-info and –databases are optional. The former allows you to save the master status info together with the data (assuming the server you are backing up data from is a master or an existing slave in a replicated environment), in case you are in the process of setting up a new slave, or resyncing an existing one. The latter, instead, allows you to specify which databases you want to backup unless you want to backup all the available databases; it accepts a space separated list of database names. Obviously, if your MySQL configuration file is in another location on your server, remember to change the path in the relevant argument accordingly.

Streaming backups to another MySQL server

The command above works pretty well if you want to make backups manually or if you want to schedule, for example, daily, full backups with cron. However if you are backing up with the purpose of restoring to another server, you may want to skip creating a local file that then needs to be transferred to the destination server for restore. You can stream Xtrabackup‘s output directly to the destination server instead, and create a backup archive there, by using netcat.

Netcat, which we have already met in the previous post on backing up with mysqldump, is a simple yet very useful utility that allows you to stream data over a TCP connection, then you can for example save that data to a file on the destination host. First, let’s get netcat running on the destination host and listening to a port of our choice, for example 6789, and by redirecting the output we’ll store it into a file on disk:

nc -l 6789 > /backups/master-data.tar

Then, on the server that hosts the data we want to migrate:

innobackupex-1.5.1 --user=... --password=... --slave-info \
--defaults-file=/etc/mysql/my.cnf --databases="db1 db2" --stream=tar ./ \
| nc xxx.xxx.xxx.xxx 6789

where xxx.xxx.xxx.xxx is the IP address of the destination host. XtraBackup will take its time depending on the amount of data to transfer, and once the process is completed netcat will be automatically closed on the destination host, with all the data copied into the target file. If all went well, XtraBackup will complete the process on the source host saying:

innobackupex-1.5.1: You must use -i (--ignore-zeros) option
for extraction of the tar stream.
101214 22:47:23 innobackupex-1.5.1: completed OK!

If not, well, you are going to have trouble, of course.

Restoring

If backing up with XtraBackup was easy, restoring isn’t any more difficult, it just requires a few additional steps vs a single one. First, while on the destination host (we are done on the source host), backup the existing MySQL data folder, if needed, and then delete or rename it, so that we can create a new empty data folder in the same location which will contain the restored data.

service mysql stop # you need to shut down MySQL before restoring
mv /var/lib/mysql /var/lib/mysql.bak && mkdir /var/lib/mysql

Next, extract the tar archive created while streaming the data during the backup, into the new MySQL data folder. As XtraBackup reminds us at the end of the backup process, we need to use the –ignore-zeros option (or -i) when untarring. Also remember to use the option -z as well if you are restoring from a gzipped tarball – here, for example’s sake, I am assuming you have used netcat instead.

cd /var/lib/mysql && tar xfvi /backups/master-data.tar

Unless something is wrong with the archive for some reason, you should now have a new MySQL data folder identical to that on the source host at the moment the backup was performed. Only thing, this data folder is not ready yet for use with MySQL but needs to be “prepared” first to apply the changes that have been captured by XtraBackup in the transactional logs during backup. Once this is done, fix the permissions and start MySQL again:

cd /var/lib/mysql && innobackupex-1.5.1 --apply-log --ibbackup=xtrabackup_51 \
./ && chown -R mysql:mysql /var/lib/mysql

service mysql start

If all went well so far, MySQL should start correctly with the restored database(s).

Setting up the new/updated instance as replication slave

Assuming now you want to use this new server as a slave of the instance you have backed data from, and know how to set up the replication (we’ll look at how to do this in another post) or that this instance already was a slave and you just need to resync it, you will have to update the information that MySQL uses to connect to the master and replicate data. If you used the option –slave-info with XtraBackup during the backup, like in my example, in the new data folder you should also see some additional files that will contain exactly this information.

Nothing complicated here… but you need to be careful as to which file you should look at. You have two possibilities here:

Scenario 1 – You are cloning an existing slave

That is, you are adding a new slave. Let’s call the host you previously backed the data from Slave-0. Slave-0 itself was already configured as slave of another MySQL server instance, its master, that we’ll call …Master. Right now you basically want to “clone” Slave-0. That is, you want to add a new slave to Master. Let’s call this new slave Slave-1. So, to recap, both Slave-0 AND the new Slave-1 will in the end be replicating from the same Master, while there won’t be any link between Slave-0 and Slave-1. Hope this is clear! 🙂

If this is what you are trying to do here, the file you need to look at is xtrabackup_slave_info. If you look at the content of this file,

cat /var/lib/mysql/xtrabackup_slave_info
CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000834', MASTER_LOG_POS=50743116

you’ll see that it already has a “CHANGE MASTER..” command ready for you to use. So open a MySQL console, and just execute that command:

mysql> stop slave;
#optional IF this instance was already a slave AND MySQL IS configured
without the skip-slave-start OPTION
mysql> CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000834',
MASTER_LOG_POS=50743116;
mysql> START SLAVE;
mysql> SHOW SLAVE STATUS \G

If everything was OK, you’ll see something like this:

...
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
...
Seconds_Behind_Master: 1643
...

Seconds_Behind_Master is an estimation (not too accurate, I must say) of the time in second it will take for this new slave to catch up with the master. How long it takes depends on how long before the backup you’ve used to build this slave was taken from the existing slave you have cloned, and on how much data has changed in the meantime on the master.

Update Tue, May 17 2011: As reader Mora points out in the comment, this definition of Seconds_Behind_Master is not correct.

Seconds_Behind_Master is how many seconds the slave is behind the master. It is not an estimation of how long its going to take to catch up, if the slave has been stopped for 1hour, it would be 3600seconds behind, but it can take anywhere from a few seconds, to a few days (or more) to catch up.

Thanks to Mora for the clarification.

Note: you may have noticed that the file containing the information about the master, only contains the name and position of the master’s binary log file in use at the moment the backup was taken; there is no username, or password. This is because credentials are not saved there for obvious reasons. So unless you have added values for MASTER_HOST, MASTER_USER and MASTER_PASSWORD to your my.cnf file, you may need to add these settings to the “CHANGE MASTER..” command above.
Scenario 2 – You are setting up a new slave, or resyncing an existing slave having as master the MySQL instance you’ve backed data from in the previous steps
In this second scenario, instead, say that

  • you want to use the MySQL server instance you previously backed your data from, as master in a typical replication; let’s call this instance Master;
  • you want to use the new instance you are currently setting up from that backup, as Slave. Or, the master-slave replication between these two already existed but you need to resync this slave.
    If this is what you are trying to do, the file you need to look at is xtrabackup_binlog_info instead. If you look at the content of this file,
cat /var/lib/mysql/xtrabackup_binlog_info
mysql-bin.000228 56875717

you’ll see something slightly different to what the other file contains. The first value is the name of the binary log file on the master, which you need to attach this slave to; the second value is instead the position within that binary log from which MySQL has to start catching up with changes from the master. Optionally, you may also see a third value being the list of names of the databases excluded from the replication.

Now you can use this information to update the link with the master. Start a MySQL console and enter the following commands to do this, then start the slave:

mysql> CHANGE MASTER TO MASTER_HOST='...', MASTER_USER='...',
MASTER_PASSWORD='...', MASTER_LOG_FILE='mysql-bin.000228',
MASTER_LOG_POS=56875717;
mysql> START SLAVE;
mysql> SHOW SLAVE STATUS \G

Also in this case, you’ll have to enter the username and password you are using for the replication process, and be careful to enter the log file name and position correctly, according to your xtrabackup_binlog_info file.

Finally, if all worked as expected, you should see:

...
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
...
Seconds_Behind_Master: 2184
...

Cool, you’re done.

I am pretty sure you’ll see right away – as soon as you try XtraBackup – the huge benefits of this alternative vs mysqldump. Make sure, though, you test backups and restores enough before using this system in production!

Know of any other tips related to XtraBackup, or of a similar or even better alternative (still free)? Please share them in the comments!