has_many :codes

Recovering from MySQL replication stops caused by failing statements

Published  

Sometimes it happens that the MySQL replication stops because some statements received from the master fail when replayed on a slave. In such cases, depending on the nature of the failure you may need to rebuild the slave from scratch using a backup from the master (I’d recommend Percona’s Xtrabackup for this), or try to re-sync the slave using something like pt-table-sync again from Percona (these tools were originally known as Maatkit).

Luckily, however, this isn’t always the case as sometimes the failing statements don’t affect the data or might only affect data that isn’t important or is removed frequently (for example tables containing temporary data); in such cases then you may not need to rebuild the slave or re-sync with the Percona tools (which perform a time consuming comparison operation -especially on large tables- and might cause high load on both master and slave, up to causing locks and various issues depending on the usage of the servers at the time and the resources available); you can instead get the slave up and running again very quickly by simply skipping the failing statements, provided that these statements can be safely ignored as well as any data they might affect.

This can be done by temporarily stopping the slave, then instructing MySQL to skip one or more statements received from the master, and eventually starting the slave again. For example, let’s assume that we find out that the replication is temporarily stopped due to some statement causing errors:

mysql> show slave status \G
*************************** 1. row ***************************
...
Slave_IO_Running: Yes
Slave_SQL_Running: No
...
Last_Errno: (some error code)
Last_Error: (description)
...
Seconds_Behind_Master: NULL
...
Last_SQL_Errno: (some error code)
Last_SQL_Error: Error ... (description)
...
1 row in set (0.00 sec)

Note the Slave_SQL_Running status, the error details, and Seconds_Behind_Master.

To fix you’d stop the slave now, and tell MySQL to skip at least the next statement (which we know for sure that it is failing) by setting the global variable SQL_SLAVE_SKIP_COUNTER. You’d then start the slave again:

mysql> stop slave;
Query OK, 0 rows affected (0.10 sec)

mysql> set global SQL_SLAVE_SKIP_COUNTER = 1;
Query OK, 0 rows affected (0.00 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

Assuming, as in the example, that only one statement is failing, by skipping that single statement the slave can happily start to replicate again:

mysql> show slave status \G
*************************** 1. row ***************************
...
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
...
Last_Errno: 0
Last_Error:
...
Seconds_Behind_Master: 5341
...
Last_SQL_Errno: 0
Last_SQL_Error:
...
1 row in set (0.00 sec)

Done this, you’d only need to wait for the slave to catch up with the master, and how long that’ll take depends on how long the replication has been in a stopped state.

If only one statement were failing, you’d be lucky. At times there can be many more failing statements really depending on the replicated databases and their usage. So in these cases you could still skip one statement per time until the replication is fine again, but you wouldn’t have any way to know how many of the next statements received from the master will eventually fail. Another option is to set SQL_SLAVE_SKIP_COUNTER to a value greater than 1, so that MySQL will skip multiple statements at once, but since you don’t know how many statements will fail, there’s the risk that some other statements that otherwise wouldn’t fail will also be missed by the slave affecting some important data.

You can avoid this by still skipping one failing statement per time, but rather than doing this manually you can automate this with very simple shell commands. For starters, let’s save somehow the credentials to connect to the MySQL instance, e.g. in some variables:

MYSQL_USER=...
MYSQL_PASSWORD=...

Next, let’s save with an alias a simple command line to find out whether the slave is currently running:

alias slave-check='mysql -u$MYSQL_USER -p$MYSQL_PASSWORD -e "show slave status \G" | \
grep "Slave_SQL_Running: Yes" | \
wc -l'

So now we can just run the command slave-check and that’ll return either 1 or 0 depending on whether the replication is running or not. Similarly, we can save with another alias a command line that tells MySQL to stop the slave, skip one statement and restart the slave:

alias skip-one='mysql -u$MYSQL_USER -p$MYSQL_PASSWORD \
-e "set global SQL_SLAVE_SKIP_COUNTER = 1; \
stop slave; \
start slave; \
select sleep(1);"'

So we can just run skip-one to automate this in a very simple way. Note that I’ve added a select sleep(1) query too since sometimes the slave might appear as running for a brief time after skipping one statement, although the slave is stopped again soon after due to other failing statements.

Finally, we can save a third command line that keeps skipping failing statements one per time until the replication is running again:

alias skip-failing-statements="while [ `slave_check` = 0 ]; do skip-one; done"

If you save these simple commands in your shell profile for example, you can recover a stopped slave (when possible) by simply running skip-failing-statements, That’s it. It’s a stupid simple tip but it might be useful to someone who is new to MySQL replication or isn’t too familiar with the command line.

© Vito Botta