share_counts Ruby gem: The easiest way to check how many times a URL has been shared on social networks!

I was looking for a way to quickly check at once how many times a URL has been shared on the most popular social networks and aggregators, but I couldn’t find any. So I wrote some code to query these social networks’ APIs and I thought it may be useful to others, so why not gem it up? In fact, I already got the confirmation that may be useful to others since I just published the gem hours ago and despite I am talking about it only now, I saw that almost 20 people had already downloaded the gem!At the moment, the gem – named share_counts – supports the following social networks and aggregators:

  • Reddit
  • Digg
  • Twitter
  • Facebook (Shares and Likes)
  • LinkedIn
  • Google Buzz
  • StumbleUpon

I may add support for other networks if needed, and I will more likely extend the gem with other methods to leverage more of what the APIs offer, so stay tuned.

Using the gem

Github repo: https://github.com/vitobotta/share_counts.
On RubyGems: https://rubygems.org/gems/share_counts/stats.

Once you have installed the gem with the usual gem install share_counts, it’s very easy to use. For example, if you want to check the Reddit score for a story, you can call the method reddit with the URL as argument:

ruby-1.9.2-p0 :001 > require "share_counts"
=> true

ruby-1.9.2-p0 :016 > ShareCounts.supported_networks
=> ["reddit", "digg", "twitter", "facebook", "fblike", "linkedin", "googlebuzz", "stumbleupon"]

ruby-1.9.2-p0 :002 > ShareCounts.reddit "https://vitobotta.com/awesomeprint-similar-production/"
Redis caching is disabled - Making request to reddit...
=> 5

It works the same way with the other networks supported. Only for Facebook there are two methods available rather than one, since Facebook has both “shares” and “likes”:

ruby-1.9.2-p0 :003 > ShareCounts.facebook "https://vitobotta.com/awesomeprint-similar-production/"
Redis caching is disabled - Making request to facebook...
=> 1

ruby-1.9.2-p0 :004 > ShareCounts.fblike "https://vitobotta.com/awesomeprint-similar-production/"
Redis caching is disabled - Making request to fblike...
=> 0

You can also get both shares and likes together:

ruby-1.9.2-p0 :007 > ShareCounts.fball "https://vitobotta.com/awesomeprint-similar-production/"
Redis caching is disabled - Making request to fball...
=> {"share_count"=>1, "like_count"=>0}

Also you can get the share counts for all the supported services in one call or otherwise specify which ones you are interested in:

ruby-1.9.2-p0 :005 > ShareCounts.all "https://vitobotta.com/awesomeprint-similar-production/"
Redis caching is disabled - Making request to reddit...
Redis caching is disabled - Making request to digg...
Redis caching is disabled - Making request to twitter...
Redis caching is disabled - Making request to facebook...
Redis caching is disabled - Making request to fblike...
Redis caching is disabled - Making request to linkedin...
Redis caching is disabled - Making request to googlebuzz...
Redis caching is disabled - Making request to stumbleupon...
=> {:reddit=>4, :digg=>1, :twitter=>2, :facebook=>1, :fblike=>0, :linkedin=>2, :googlebuzz=>0, :stumbleupon=>0}

ruby-1.9.2-p0 :006 > ShareCounts.selected "https://vitobotta.com/awesomeprint-similar-production/", [ :reddit, :linkedin ]
Redis caching is disabled - Making request to reddit...
Redis caching is disabled - Making request to linkedin...
=> {:reddit=>4, :linkedin=>2}

In these cases you’ll get back a hash instead.

At this point, you may have noticed the message “Redis caching is disabled” being printed with each call. That was because I had the caching disabled. Since I’ve noticed that a) some of these social networks’ APIs aren’t available/working 100% of the time, and b) some of them may do rate limiting if you are making too many requests in a short period of time, the gem also supports caching with Redis.

By default caching is disabled since you may not be running Redis or you may want to use some other caching in your application, or you may not want to use caching at all. So the first step if you do want to use the caching, is to enable it. By default, share_counts assumes that Redis is listening on 127.0.0.1:6379, but you can override this by either setting in advance the global variable $share_counts_cache if you already have a reference to a connection with Redis (and you’re using the same redis gem used by share_counts), or by passing that reference as argument. You can alternatively specify host and port when you enable the caching:

# Using caching with the default settings
ruby-1.9.2-p0 :009 > ShareCounts.use_cache
=> #<Redis client v2.1.1 connected to redis://127.0.0.1:6379/0 (Redis v2.0.3)>

# Using an existing reference to a connection to Redis
$share_counts_cache = a_Redis_connection

# Same thing as above, but by passing the reference to the connection as argument
ruby-1.9.2-p0 :010 > ShareCounts.use_cache :redis_store => a_Redis_connection

# Specifying host and port for the connection to Redis
ruby-1.9.2-p0 :010 > ShareCounts.use_cache :host => "localhost", :port => 6379
=> #<Redis client v2.1.1 connected to redis://127.0.0.1:6379/0 (Redis v2.0.3)>

Cached share counts expire by default in 2 minutes, but you can again override this by setting the global variable $share_counts_cache_expire to a value in seconds.

So, let’s compare now using the gem with and without caching:

ruby-1.9.2-p0 :002 > require 'benchmark'
=> true

# Enabling caching
ruby-1.9.2-p0 :003 > ShareCounts.use_cache
=> #<Redis client v2.1.1 connected to redis://127.0.0.1:6379/0 (Redis v2.0.3)>

# First run, values are not cached
ruby-1.9.2-p0 :004 > Benchmark.realtime { ShareCounts.all "https://vitobotta.com/awesomeprint-similar-production/" }
Making request to reddit...
Making request to digg...
Making request to twitter...
Making request to facebook...
Making request to fblike...
Making request to linkedin...
Making request to googlebuzz...
Making request to stumbleupon...
=> 3.7037899494171143

# Now values are cached
ruby-1.9.2-p0 :005 > Benchmark.realtime { ShareCounts.all "https://vitobotta.com/awesomeprint-similar-production/" }
Loaded reddit count from cache
Loaded digg count from cache
Loaded twitter count from cache
Loaded facebook count from cache
Loaded fblike count from cache
Loaded linkedin count from cache
Loaded googlebuzz count from cache
Loaded stumbleupon count from cache
=> 0.003225088119506836

You can see which URLs have been cached and with which available share counts, with the cached method:

ruby-1.9.2-p0 :013 > ShareCounts.cached
=> {"https://vitobotta.com/awesomeprint-similar-production/"=>{:fblike=>0, :stumbleupon=>0, :linkedin=>2, :googlebuzz=>0, :facebook=>1, :twitter=>2, :digg=>1, :reddit=>5}}

Also, if you need you can clear the cached values:

ruby-1.9.2-p0 :013 > ShareCounts.cached
=> {"https://vitobotta.com/awesomeprint-similar-production/"=>{:fblike=>0, :stumbleupon=>0, :linkedin=>2, :googlebuzz=>0, :facebook=>1, :twitter=>2, :digg=>1, :reddit=>5}}
ruby-1.9.2-p0 :014 > ShareCounts.clear_cache
=> ["ShareCounts||fblike||https://vitobotta.com/awesomeprint-similar-production/", "ShareCounts||stumbleupon||https://vitobotta.com/awesomeprint-similar-production/", "ShareCounts||linkedin||https://vitobotta.com/awesomeprint-similar-production/", "ShareCounts||googlebuzz||https://vitobotta.com/awesomeprint-similar-production/", "ShareCounts||facebook||https://vitobotta.com/awesomeprint-similar-production/", "ShareCounts||twitter||https://vitobotta.com/awesomeprint-similar-production/", "ShareCounts||digg||https://vitobotta.com/awesomeprint-similar-production/", "ShareCounts||reddit||https://vitobotta.com/awesomeprint-similar-production/"]
ruby-1.9.2-p0 :015 > ShareCounts.cached
=> {}

Notes:

  • If a request fails for one network, its share count won’t be cached and will remain set to nil. This way you can easily know whether a service’s API failed by just checking whether its share count for the given URL is nil or not.
  • Since you may be already using Redis in your app for something else, the gem namespaces the keys so that if you clear the cache, only its keys will be deleted.

A look at the code

The code is on Github if you want to have a look. Here I’ll highlight a few things.

All the methods to retrieve share counts for each supported service are wrapped in the module ShareCounts, as you may already have guessed:

module ShareCounts

extend Common
extend Caching

def self.supported_networks
%w(reddit digg twitter facebook fblike linkedin googlebuzz stumbleupon)
end

def self.reddit url
try("reddit", url) {
extract_count from_json( "http://www.reddit.com/api/info.json", :url => url ),
:selector => "data/children/data/score"
}
end

...

In particular, the try method will try to fetch the requested share count(s) by either making an HTTP request (or multiple requests depending on which share counts are being requested) or, if caching is enabled, from the cache.

def try service, url, &block
cache_key = "ShareCounts||#{service}||#{url}"
if cache_enabled?
if result = from_redis(cache_key)
puts "Loaded #{service} count from cache"
result
else
puts "Making request to #{service}..."
to_redis(cache_key, yield)
end
else
puts "Redis caching is disabled - Making request to #{service}..."
yield
end
rescue Exception => e
puts "Something went wrong with #{service}: #{e}"
end

Since most of these APIs follow a common pattern, HTTP requests are made with the assumption that APIs will return a JSON response with or without a callback method; if a callback method is provided, then the response is first manipulate to just extract the JSON data we need. The make_request method will attempt a request to a network’s API for a maximum of three times, with a maximum timeout of 2 seconds for each attempt. There’s a reason for this: while I was testing these APIs, I noticed that in most cases if a request didn’t return within a couple seconds, it then either timed out after a long time or returned with a 503 Service Unavailable status code. From this point of view, I must say I was surprised to see that Digg‘s API was likely the least reliable of the bunch, returning a 503 code too often, although I wasn’t making too many requests in a short period of time, so I doubt this was because of rate limiting. Anyway, the combination of a 2 seconds timeout and the three attempts, means we expect a response from each service within a few seconds and that’s a good compromise if you use caching. To make requests, I am using one of my favourite gems, rest-client (from the Github user archiloque‘s fork since it seems to be more up to date than the original one by Heroku‘s Adam Wiggins):

def make_request *args
result = nil
attempts = 1

begin
timeout(2) do
url = args.shift
params = args.inject({}) { |r, c| r.merge! c }
response = RestClient.get url, { :params => params }

# if a callback is specified, the expected response is in the format "callback_name(JSON data)";
# with the response ending with ";" and, in some cases, "\n"
result = params.keys.include?(:callback) \
? response.gsub(/^(.*);+\n*$/, "\\1").gsub(/^#{params[:callback]}\((.*)\)$/, "\\1") \
: response
end

rescue Exception => e
puts "Failed #{attempts} attempt(s)"
attempts += 1
retry if attempts <= 3
end

result
end

As for the extraction of the actual share counts from each API’s response, I was pleased to see a common pattern in the usage of JSON, so it was as easy as writing a simple method that “queries” the JSON data in a way that somehow recalls XPATH for XML. Arguments are the JSON data and a :selector => “where/the/share/count/is”, single key hash:

def extract_count *args
json = args.shift
result = args.first.flatten.last.split("/").inject( json.is_a?(Array) ? json.first : json ) {
|r, c| r[c].is_a?(Array) ? r[c].first : r[c]
}
end

The stuff needed for the caching with Redis is in a separate mix-in. If you haven’t used Redis yet, you can see its most basic usage from looking at this code.

To initialise a connection and optionally specify host and port:

def use_cache *args
arguments = args.inject({}) { |r, c| r.merge(c) }
$share_counts_cache ||= arguments[:redis_store] ||
Redis.new(:host => arguments[:host] || "127.0.0.1", :port => arguments[:port] || "6379")
end

To read from and write to Redis:

def from_redis(cache_key)
value = $share_counts_cache.get(cache_key)
return if value.nil?
Marshal.load value
end

def to_redis(cache_key, value)
$share_counts_cache.set cache_key, Marshal.dump(value)
$share_counts_cache.expire cache_key, $share_counts_cache_expire || 120
value
end

Then we have methods to return all the cached values used by the gem, and to clear those cached values:

def cached
urls = ($share_counts_cache || {}).keys.select{|k| k =~ /^ShareCounts/ }.inject({}) do |result, key|
data = key.split("||"); network = data[1]; url = data[2];
count = from_redis("ShareCounts||#{network}||#{url}")
(result[url] ||= {})[network.to_sym] = count unless ["all", "fball"].include? network
result
end
urls
end

def clear_cache
($share_counts_cache || {}).keys.select{|cache_key| cache_key =~ /^ShareCounts/ }.each{|cache_key|
$share_counts_cache.del cache_key}
end

As you can see keys are sort of “namespaced” and I am using inject to build a hash with the cached URLs and share counts.

APIs: a few exceptions to the “rule”

As said, most of the APIs for supported social networks follow a common pattern in their usage of JSON data. However there were two exceptions; a small one with Google Buzz‘s, in that it returns a JavaScript object -instead of an array- having as unique property the URL specified as argument; the value of that property then is the actual share count on Google Buzz. So in this case rather than using the extract_count method as for the other JSON-based APIs, all I had to do is getting the value of that property once parsed the JSON response:

def self.googlebuzz url
try("googlebuzz", url) {
from_json("http://www.google.com/buzz/api/buzzThis/buzzCounter",
:url => url, :callback => "google_buzz_set_count" )[url]
}
end

The second exception, instead, is StumbleUpon. I was so surprised and disappointed to see that they don’t have an API yet! (unless I missed it). It looks like StumbleUpon is a little behind the competition on this front. Luckily, despite the lack of an API, it wasn’t much more difficult to fetch share counts for SU too; in this case, once identified the HTML returned when their button is displayed, I could use Nokogiri to extract the share count, using XPATH:

def self.stumbleupon url
try("stumbleupon", url) {
Nokogiri::HTML.parse(
make_request("http://www.stumbleupon.com/badge/embed/5/", :url => url )
).xpath( "//body/div/ul/li[2]/a/span").text.to_i
}
end

So this was a quick look at the code as it is now, but I expect to add more methods to fetch more information from the APIs, so keep an eye on the Github repo if you plan on using this gem.

Also, if you have any suggestions on what to add or how to improve it, please let me know in the comments.

Why you should think twice before using awesome_print (and similar) in production

In the previous posts I’ve shown several examples on how to prettify Ruby objects in a console, or in an application’s logs, with the purpose of making this information more readable, more user friendly, thus more useful, especially when debugging; I’ve also suggested how, by overriding an object’s inspect method, it is also possible to obtain information about the object that is not only displayed in a prettier, colourful way, but that is also more complete than what default “printers” usually return; I suggest your read those posts first – unless you have already – to get some context.

In particular, we’ve also seen how easy it is to do all this by using gems such as awesome_print, rather than having to write custom inspect methods.

At this point, whether you already knew awesome_print (and similar) or not, you may be tempted to use it also to improve your application logs in a production environment – unless you are already doing so!; normal production logs, in fact, can be pretty useless or at least difficult to read when investigating unknown problems. So using awesome_print and similar methods also in production, as shown, may sound like a good idea.

However, if you have already read the previous posts, you may also remember that I mentioned that you should be careful as to when and how you use awesome_print. The advice is particularly valid for awesome_print, but it may also apply to other Ruby gems that work in a similar way. The reason lies behind the way these gems -in most cases- access Ruby objects and their properties so that they can print the relevant information.

This may surprise some, but for example awesome_print can be a real performance killer if misused in a production environment! Let’s have a look at an example with a super simple Sinatra application -a single file, test.rb- to illustrate the issue:

require "rubygems"
require "active_record"
require "sinatra"
require "sinatra/activerecord"

class PageStatus < ActiveRecord::Base
establish_connection :host => "localhost",
:adapter => "mysql",
:user => "root",
:database => "blog_engine_development"
end

get "/test" do
p PageStatus.all
PageStatus.count.to_s
end

So here we have a very simple Sinatra application with an ActiveRecord model, and a single route that first logs all the records to stdout (through the default inspect method thanks to the shorthand p) – just to simulate some real debugging information being logged by an application – and then returns the records count. Let’s check quickly that it works, by starting the Sinatra application, then issuing a request with wget, to ensure that the output in stdout is what we expect (the array of records):

# starting the Sinatra application
510 $ ruby test.rb
== Sinatra/1.1.2 has taken the stage on 4567 for development with backup from Thin
>> Thin web server (v1.2.7 codename No Hup)
>> Maximum connections set to 1024
>> Listening on 0.0.0.0:4567, CTRL+C to stop

# issuing a request with wget - see the records count returned
511 $ wget -qO- http://0.0.0.0:4567/test
4

# back to the application output
== Sinatra/1.1.2 has taken the stage on 4567 for development with backup from Thin
>> Thin web server (v1.2.7 codename No Hup)
>> Maximum connections set to 1024
>> Listening on 0.0.0.0:4567, CTRL+C to stop
D, [2011-01-23T22:00:20.973396 #21308] DEBUG -- : PageStatus Load (0.3ms) SELECT `page_statuses`.* FROM `page_statuses`
[#<PageStatus id: 1, name: "Draft">, #<PageStatus id: 2, name: "Published">, #<PageStatus id: 3, name: "Scheduled">, #<PageStatus id: 4, name: "Deleted">]
D, [2011-01-23T22:00:20.976202 #21308] DEBUG -- : SQL (0.2ms) SELECT COUNT(*) FROM `page_statuses`
127.0.0.1 - - [23/Jan/2011 22:00:20] "GET /test HTTP/1.0" 200 1 0.0786

So, as expected, we can see the output generated by p. Here we are actually not storing this debugging information in a log file, but bear with me for a second, you’ll see it doesn’t really matter. Next, let’s see for example with httperf how many requests per second can be processed. For the purpose of this test, we don’t really need precise figures nor more scientifically or mathematically accurate tests; a rough indication of the application’s performance will be more than enough:

604 $ httperf --server 0.0.0.0 --port 4567 --num-conns 1000 --num-calls=10 --uri /test
httperf --client=0/1 --server=0.0.0.0 --port=4567 --uri=/test --send-buffer=4096 --recv-buffer=16384 --num-conns=1000 --num-calls=10
httperf: warning: open file limit > FD_SETSIZE; limiting max. # of open files to FD_SETSIZE
Maximum connect burst length: 1

Total: connections 1000 requests 10000 replies 10000 test-duration 23.017 s

Connection rate: 43.4 conn/s (23.0 ms/conn, <=1 concurrent connections)
Connection time [ms]: min 16.6 avg 23.0 max 154.7 median 19.5 stddev 9.8
Connection time [ms]: connect 0.3
Connection length [replies/conn]: 10.000

Request rate: 434.5 req/s (2.3 ms/req)
Request size [B]: 64.0

Reply rate [replies/s]: min 426.8 avg 431.9 max 443.2 stddev 7.6 (4 samples)
Reply time [ms]: response 2.3 transfer 0.0
Reply size [B]: header 137.0 content 1.0 footer 0.0 (total 138.0)
Reply status: 1xx=0 2xx=10000 3xx=0 4xx=0 5xx=0

CPU time [s]: user 4.50 system 18.45 (user 19.5% system 80.1% total 99.7%)
Net I/O: 85.7 KB/s (0.7*10^6 bps)

Errors: total 0 client-timo 0 socket-timo 0 connrefused 0 connreset 0
Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0

I got around 434 req/sec. Update: As suggested by reader nicholaides, I have also included a test without logging at all for comparison, so to see what sort of impact normal logging has on performance vs no logging at all. So I have simply removed the call to the p method in the test above, leaving everything else as it was:

...
get "/test" do
PageStatus.all
PageStatus.count.to_s
end

The result:

504 $ httperf --client=0/1 --server=0.0.0.0 --port=4567 --uri=/test --send-buffer=4096 --recv-buffer=16384 --num-conns=1000 --num-calls=10
httperf --client=0/1 --server=0.0.0.0 --port=4567 --uri=/test --send-buffer=4096 --recv-buffer=16384 --num-conns=1000 --num-calls=10
httperf: warning: open file limit > FD_SETSIZE; limiting max. # of open files to FD_SETSIZE
Maximum connect burst length: 1

Total: connections 1000 requests 10000 replies 10000 test-duration 19.966 s

Connection rate: 50.1 conn/s (20.0 ms/conn, <=1 concurrent connections)
Connection time [ms]: min 14.3 avg 20.0 max 59.9 median 16.5 stddev 8.6
Connection time [ms]: connect 0.2
Connection length [replies/conn]: 10.000

Request rate: 500.9 req/s (2.0 ms/req)
Request size [B]: 64.0

Reply rate [replies/s]: min 493.8 avg 505.2 max 519.4 stddev 13.0 (3 samples)
Reply time [ms]: response 2.0 transfer 0.0
Reply size [B]: header 137.0 content 1.0 footer 0.0 (total 138.0)
Reply status: 1xx=0 2xx=10000 3xx=0 4xx=0 5xx=0

CPU time [s]: user 3.90 system 16.02 (user 19.5% system 80.2% total 99.8%)
Net I/O: 98.8 KB/s (0.8*10^6 bps)

Errors: total 0 client-timo 0 socket-timo 0 connrefused 0 connreset 0
Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0

So this tiny sample application performs about 15% better than without logging. Again, this is just an example to give an idea, figures may vary greatly depending on the application and too many other factors, but the point here is that an application without logging is faster than with logging – which should be expected behaviour anyway.

Next, let’s change the code slightly so to use awesome_print instead of the usual inspect method (notice the call to ap instead of p):

require "rubygems"
require "active_record"
require "sinatra"
require "sinatra/activerecord"
require "awesome_print"

class PageStatus < ActiveRecord::Base
establish_connection :host => "localhost",
:adapter => "mysql",
:user => "root",
:database => "blog_engine_development"
end

get "/test" do
ap PageStatus.all
PageStatus.count.to_s
end

With wget, we can verify that, as expected, the debugging information we are sending to stdout is now a lot prettier and more readable thanks to awesome_print:

537 $ ruby test.rb
== Sinatra/1.1.2 has taken the stage on 4567 for development with backup from Thin
>> Thin web server (v1.2.7 codename No Hup)
>> Maximum connections set to 1024
>> Listening on 0.0.0.0:4567, CTRL+C to stop
D, [2011-01-23T22:11:14.943581 #21337] DEBUG -- : PageStatus Load (0.2ms) SELECT `page_statuses`.* FROM `page_statuses`
[
[0] #<PageStatus:0x000001023fb850> {
:id => 1,
:name => "Draft"
},
[1] #<PageStatus:0x000001023fb710> {
:id => 2,
:name => "Published"
},
[2] #<PageStatus:0x000001023fb5d0> {
:id => 3,
:name => "Scheduled"
},
[3] #<PageStatus:0x000001023fb490> {
:id => 4,
:name => "Deleted"
}
]
D, [2011-01-23T22:11:14.958228 #21337] DEBUG -- : SQL (0.3ms) SELECT COUNT(*) FROM `page_statuses`
127.0.0.1 - - [23/Jan/2011 22:11:14] "GET /test HTTP/1.0" 200 1 0.0965

But what about performance? Let’s repeat the same httperf test with the awesome_print gem enabled:

601 $ httperf --server 0.0.0.0 --port 4567 --num-conns 1000 --num-calls=10 --uri /test
httperf --client=0/1 --server=0.0.0.0 --port=4567 --uri=/test --send-buffer=4096 --recv-buffer=16384 --num-conns=1000 --num-calls=10
httperf: warning: open file limit > FD_SETSIZE; limiting max. # of open files to FD_SETSIZE
Maximum connect burst length: 1

Total: connections 1000 requests 10000 replies 10000 test-duration 31.787 s

Connection rate: 31.5 conn/s (31.8 ms/conn, <=1 concurrent connections)
Connection time [ms]: min 23.0 avg 31.8 max 158.3 median 27.5 stddev 11.4
Connection time [ms]: connect 0.3
Connection length [replies/conn]: 10.000

Request rate: 314.6 req/s (3.2 ms/req)
Request size [B]: 64.0

Reply rate [replies/s]: min 301.6 avg 314.4 max 327.4 stddev 9.9 (6 samples)
Reply time [ms]: response 3.2 transfer 0.0
Reply size [B]: header 137.0 content 1.0 footer 0.0 (total 138.0)
Reply status: 1xx=0 2xx=10000 3xx=0 4xx=0 5xx=0

CPU time [s]: user 6.18 system 25.52 (user 19.4% system 80.3% total 99.7%)
Net I/O: 62.1 KB/s (0.5*10^6 bps)

Errors: total 0 client-timo 0 socket-timo 0 connrefused 0 connreset 0
Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0

Wow, the application’s performance dropped by almost a good 30%, by just using awesome_print once vs p! And the difference is -37% if we compare with no logging. Think of a real world application, where you may want to log many objects and data, and imagine what sort of impact on performance it would have if you used awesome_print to prettify your logs!

Same test, but with a small change in the code so that we use awesome_print twice in the same route:

…
get "/test" do
p PageStatus.all
p PageStatus.first
PageStatus.count.to_s
end
...

With inspect (p), tests show figures around 350 req/sec, while I got values around 250 req/sec with awesome_print – still around 30% performance; then again, with 3 calls to p (I’ve simply added “p PageStatus.last“), I got around 270 req/sec, and around 190 req/sec with ap. Yet again, the difference is around 30%.

What do these figures show? Essentially two things:

a) however prettier application logs can be, it’s not a good idea to use awesome_print (or similar) in a production environment as it clearly affects performance in a significant way;

b) regardless of which method(s) your application uses to log debugging information, the more information your application has to manipulate and prettify before logging it with separate p/ap (etc) calls, the slower the application itself becomes. Also, as suggested that a test I added later to the post, even a minimum amount of logging does impact the performance of an application vs no logging.

The second point should be something obvious somehow, but from my experience with other developers I am led to think that most people tend to overlook how much heavy logging can hurt performance, and I also think most people only think of performance drawbacks in association with logging to file, because of the added I/O overhead. In truth, as shown in these simple tests, performance is affected by logging regardless of whether information is stored in a file or not.

I usually prefer having as much debugging information as possible in my application logs, especially for new, large and important applications for which even a small unexpected or unidentified problem can produce significant loss in revenue for my client or employer, if it is not fixed right away. I recently built (in Sinatra) a pretty large and complex API (by Sinatra standards) which is responsible for the core operations of my former employer’s business in the online lead generation market; these operations – in particular: cleansing, validation and delivery of data – are made of several processes and elements; without a good amount of logging, it would have been very hard, in some cases, to figure out the source of a problem.

To address performance penalties while being able to collect a good amount of debugging information, I started work on a component that will collect this information in a different way to what other loggers do, with a lower impact on the performance. However it is at a very early stage (know of something similar and ready to use? please let me know!), so as a general rule following these simple tests, my advice is obviously to keep logging limited to a reasonable amount, and to always keep the previous considerations regarding performance in mind when setting up applications in production, especially if you still wish to prettify your logs with awesome_print or similar methods.

A few more methods to format and beautify Ruby objects in console and logs

This is a sort of quick follow up to the previous post. Since then, I’ve seen a few other tricks that can be used to improve the display of output in a Ruby console, as well as in application logs, to make that information more useful for debugging or, also, for display in a dashboard.

The reason I have been looking for this kind of tricks is that I am now working on a component that should make it easier to investigate problems with applications in a production environment; this component (at a very early stage) works by collecting, parsing and interpreting the information in the application logs in a different way to how this is usually done. One of the goals, however, is also to make it possible to collect as much information as may be required, but without affecting too much the application’s performance.

This is because unfortunately, as I mentioned in the previous post, manipulating what is sent to stdout or to a log file with the purpose of prettifying what gets logged, usually results in some significant performance drawbacks that should seriously be taken into consideration if you wish to prettify logged information in a production environment.

We’ll see what sort of performance penalty I am referring to – with a few examples – in the next post, anyways it can still be useful to prettify information that can help with debugging, at least in development. So, here’s a few more ways of formatting Ruby objects in a console and in logs.

Pretty print of JSON

The module JSON, available by requiring the standard library by the same name, exposes a method named pretty_generate that formats any JSON text passed as argument with indentation and new lines, making it easier to read:

ruby-1.9.2-p0 > require "json"
=> true

ruby-1.9.2-p0 > json = { :a => 1, :b => [1,2,3], :c => { :d => "e" } }
=> {:a=>1, :b=>[1, 2, 3], :c=>{:d=>"e"}}

ruby-1.9.2-p0 > puts JSON.pretty_generate json
{
"a": 1,
"b": [
1,
2,
3
],
"c": {
"d": "e"
}
}
=> nil

Pretty print of XML

It’s easy to do something similar with any XML text, by using the nokogiri gem as shown below:

require "nokogiri"

def xp(xml_text)
xsl =<<XSL
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
XSL

doc = Nokogiri::XML(xml_text)
xslt = Nokogiri::XSLT(xsl)
out = xslt.transform(doc)

puts out.to_xml
end

# then..

ruby-1.9.2-p0 > xp "<a><b>c</b></a>"
<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>c</b>
</a>
=> nil

If you work frequently with JSON and XML, you can also add a couple shortcuts to your ~/.irbrc file:

require "rubygems"
require "json"
require "nokogiri"

def jp(json_text)
puts JSON.pretty_generate json_text
end

def xp(xml_text)
xsl =<<XSL
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
XSL

doc = Nokogiri::XML(xml_text)
xslt = Nokogiri::XSLT(xsl)
out = xslt.transform(doc)

puts out.to_xml
end

Then, to use them in any console:

ruby-1.9.2-p0 > jp({ :a => { :b => "c" } })
{
"a": {
"b": "c"
}
}
=> nil

ruby-1.9.2-p0 > xp "<a><b>c</b></a>"
<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>c</b>
</a>
=> nil

Pretty print of records and collections of records

Apart from awesome_print we’ve seen in the previous post, there’s another gem, named hirb, that can be even more effective in some cases, but especially with ActiveRecord objects and collections of objects, in that it displays them in a similar way to how tables and query results are formatted in a MySQL console. The easiest way to use it is as follows:

ruby-1.9.2-p0 > require "hirb"
=> nil

ruby-1.9.2-p0 > Hirb.enable
=> true

ruby-1.9.2-p0 > PageStatus.all
+----+-----------+
| id | name |
+----+-----------+
| 1 | Draft |
| 2 | Published |
| 3 | Scheduled |
| 4 | Deleted |
+----+-----------+
4 rows in set

Hirb allows more than this, so be sure to have a look at the documentation.

Prettier, user friendly printing of Ruby objects

If you, like me, don’t like too much how Ruby objects usually look when displayed in a console (output on single lines, lack of indentation, and so on), it’s likely you also have looked for ways to prettify this kind of output. This can be useful in a console, but also to have more user friendly, more readable debugging information in your application’s log files. Not only: as we’ll see, knowing which “printer” methods to use for each case can help display or log more complete information about your Ruby objects.

Built-in “printers”

The two most common ways of “printing” objects in a console or, more generally, to stdout, are the Kernel#puts and Kernel#print methods. With an object given as argument, print simply outputs the text returned by the object’s to_s method. puts works in a very similar way, but it also adds a new line character at the end of the string and -for some objects- it further manipulates the output; for example, if the argument is an array, puts will display the array’s items one item per line rather than the whole array on a single line as with print:

ruby-1.9.2-p0 > print [1,2,3]
[1, 2, 3] => nil

ruby-1.9.2-p0 > puts [1,2,3]
1
2
3
=> nil

The output of both these methods is readable enough for very simple objects such as strings, numbers, and so on, but it becomes less useful with other objects, in that these methods only display the class name and -if the argument is an instance of a class- the object_id:

ruby-1.9.2-p0 > print Category.first
#<Category:0x00000100d34c20> => nil

ruby-1.9.2-p0 > puts Category.first
#<Category:0x00000100d19010>
=> nil

This kind of output is pretty useless when you are trying to debug something in the console, or if your application prints some debugging information to logs, as it doesn’t include an instance’s attributes, for example, nor their values – which is exactly what you are usually looking for when you look at logs or debug information in a console. A slightly better way of printing objects to stdout is the inspect method, which is automatically used -for example- when assigning a value to a variable:

ruby-1.9.2-p0 > category = Category.first
=> #<Category id: 1, name: "Blogging", slug: "blogging", posts_count: 0, seo: nil, type: "Category", created_at: "2011-01-16 17:47:48", updated_at: "2011-01-16 17:47:48">

By default, unless other “printer” methods such as puts, print are used, a class or an instance of a class are printed by displaying what the method inspect returns. This method is usually available both as class method and as instance method; unless overridden, it simply shows the name of the class, the attributes, plus -only if it’s an instance- the object_id and the attributes’ values.

Both classes and instances are quite readable when printed in a console, as long as they only have a small number of attributes:

ruby-1.9.2-p0 > Category
Category(id: integer, name: string, slug: string, posts_count: integer, seo: string, type: string, created_at: datetime, updated_at: datetime)

ruby-1.9.2-p0 > Category.first
=> #<Category id: 1, name: "Blogging", slug: "blogging", posts_count: 0, seo: nil, type: "Category", created_at: "2011-01-16 17:47:48", updated_at: "2011-01-16 17:47:48">

Same result if you explicitly use the inspect method on an object, or the Kernel#p method instead, which simply is a shortcut for “puts obj.inspect“. It works in a similar way to the more common Kernel#puts method, with the difference that p displays the result of the object’s inspect method, rather than the result of the object’s to_s method as with puts:

ruby-1.9.2-p0 > Category.inspect
=> "Category(id: integer, name: string, slug: string, posts_count: integer, seo: string, type: string, created_at: datetime, updated_at: datetime)"

ruby-1.9.2-p0 > p Category
Category(id: integer, name: string, slug: string, posts_count: integer, seo: string, type: string, created_at: datetime, updated_at: datetime)
=> Category(id: integer, name: string, slug: string, posts_count: integer, seo: string, type: string, created_at: datetime, updated_at: datetime)

The output isn’t great, in that most people find vertical lists more readable, but it’s OK. However it soon becomes more unreadable and more complicated to follow if you print objects with many properties/attributes, or for example collections:

ruby-1.9.2-p0 > Category.all
[#<Category id: 1, name: "Blogging", slug: "blogging", posts_count: 0, seo: nil, type: "Category", created_at: "2011-01-16 17:47:48", updated_at: "2011-01-16 17:47:48">, #<Category id: 2, name: "Database", slug: "database", posts_count: 0, seo: nil, type: "Category", created_at: "2011-01-16 17:47:48", updated_at: "2011-01-16 17:47:48">, ... ]

Another alternative to p, puts, inspect is the to_yaml instance method, which is already available for objects such as ActiveRecord models, and more. All what to_yaml does, is display the instance serialised in YAML format:

ruby-1.9.2-p0 > Category.first.to_yaml
=> "--- !ruby/object:Category \nattributes: \n id: \"1\"\n name: Blogging\n slug: blogging\n posts_count: \"0\"\n seo: \n type: Category\n created_at: 2011-01-16 17:47:48\n updated_at: 2011-01-16 17:47:48\nattributes_cache: {}\n\nchanged_attributes: {}\n\ndestroyed: false\nmarked_for_destruction: false\npersisted: true\npreviously_changed: {}\n\nreadonly: false\n"

The output is not too different from the previous one, and it’s actually a little less readable because of new lines and other characters that are escaped when the YAML output is printed as simple string.
A first step to prettify this kind of output at least for instances -as said to_yaml is an instance method, therefore it cannot be used to prettify classes- could be using this method together with puts:

ruby-1.9.2-p0 > puts Category.first.to_yaml
--- !ruby/object:Category
attributes:
id: "1"
name: Blogging
slug: blogging
posts_count: "0"
seo:
type: Category
created_at: 2011-01-16 17:47:48
updated_at: 2011-01-16 17:47:48
attributes_cache: {}

changed_attributes: {}

destroyed: false
marked_for_destruction: false
persisted: true
previously_changed: {}

readonly: false
=> nil

I think it’s already a lot more readable, and you can even save some typing by using the shorthand y method:

# same as "puts Category.first.to_yaml"
ruby-1.9.2-p0 > y Category.first
--- !ruby/object:Category
attributes:
id: "1"
name: Blogging
slug: blogging
posts_count: "0"
seo:
type: Category
created_at: 2011-01-16 17:47:48
updated_at: 2011-01-16 17:47:48
attributes_cache: {}

changed_attributes: {}

destroyed: false
marked_for_destruction: false
persisted: true
previously_changed: {}

readonly: false
=> nil

This output certainly looks more readable than the previous ones, and also collections look a lot more readable. However to_yaml also shows information that you may not be interested in (unless you only show attributes with “y Category.first.attributes“, but that’s.. uhm.. too much typing); plus it is an instance method (so you can’t use it to prettify the display of a class) and isn’t always available out of the box with any kind of objects, as it is instead for ActiveRecord models.

Other limits of built-in “printers”

Instance variables and attributes

Apart from the ugly output, another issue with these “printers” is that they don’t always display the complete structure of an object. Let’s take the following class as example:

class TestClass
def initialize(an_argument, another_argument)
@an_argument = an_argument
@another_argument = another_argument
@an_instance_variable = "#{an_argument}-#{another_argument}"
self
end
end

It’s just a very simple class with an initialize method that, given two arguments, creates and instance and stores the two arguments in two instance variables, then stores the two values concatenated in a third instance variable, and finally returns a reference to the instance itself. Let’s see what happens with the various methods we’ve seen so far, if we print an instance of this class in a console:

# puts
puts TestClass.new("A", "B")
#<TestClass:0x000001049034e0>
=> nil

# print
print TestClass.new("A", "B")
#<TestClass:0x000001048fbd30> => nil

# implicitly using "inspect";
# same as ...
# TestClass.new("A", "B").inspect
# or ...
# p TestClass.new("A", "B")
TestClass.new("A", "B")
=> #<TestClass:0x000001048f6470 @an_instance_variable="A-B", @another_argument="B", @an_argument="A">

# y; same as TestClass.new("A", "B").to_yaml
y TestClass.new("A", "B")
--- !ruby/object:TestClass
an_argument: A
an_instance_variable: A-B
another_argument: B
=> nil

# pretty print
require "pp"
pp TestClass.new("A", "B")
#<TestClass:0x000001048d57e8
@an_argument="A",
@an_instance_variable="A-B",
@another_argument="B">
=> #<TestClass:0x000001048d57e8 @an_instance_variable="A-B", @another_argument="B", @an_argument="A">

As we can see, and as previously anticipated, both puts and print are pretty useless in this context, because they tell us nothing of the internal state of the object, and in particular they tell us nothing of the instance variables that we know are there. inspect, y and pp display a more useful output instead. pp, which stands for pretty print, is yet another alternative that displays instance variables and attributes in a vertical list as with y, and does not use serialisation, but it requires an additional module for little gain, as you can see.

More complex objects

Another, perhaps more important, issue with all these “printers”, is that for instances of ActiveRecord models, the values of attributes are displayed as they are stored in the database.

This is quite important, because it means that more complex models are not always displayed with their complete structure. An example may be an ActiveRecord model that has a text column storing another object in YAML serialised format (using the serialize class method).

For instance, say we have the model Category that stores a serialised instance of the model SeoOptions in the column seo, using the tableless_model gem.

If you print an instance of such a model, you will see the presence of the attribute seo, and it’s value as it is stored in the database. In the example this value is still nil, but the reader/getter method for this attribute actually returns an instance of SeoOptions, with its own attributes already set to their default values:

ruby-1.9.2-p0 > Category.first
=> #<Category id: 1, name: "Blogging", slug: "blogging", posts_count: 0, seo: nil, type: "Category", created_at: "2011-01-16 17:47:48", updated_at: "2011-01-16 17:47:48">

ruby-1.9.2-p0 > Category.first.seo
=> <#SeoSettings meta_description="" meta_keywords="" noarchive=false nofollow=false noindex=false title_tag="">

In these cases, all the methods we’ve seen are useless, in that none of them tell us what seo represents, but they only tell us that the attribute has a nil value in the database. Not very useful, if you are debugging something in a console, or if you are printing this kind of information to a log file for later inspection.

Better alternatives

Hijacking the inspect method

Although the output of these default “printers” is often ugly and not enough complete, it is possible to improve things by hijacking or overriding the inspect method we’ve seen earlier. This can actually be quite useful in some cases, if you prefer log some debugging information in some custom format. Let’s see an example of custom inspect for the test class we’ve used in one of the previous examples:

class TestClass
def initialize(an_argument, another_argument)
@an_argument = an_argument
@another_argument = another_argument
@an_instance_variable = "#{an_argument}-#{another_argument}"
self
end

def inspect
instance_variables.inject([
"\n#TestClass",
"\tObject_i = #{object_id}",
"\tInstance variables:"
]) do |result, item|

result << "\t\t#{item} = #{instance_variable_get(item)}"
result
end.join("\n")
end
end

In the example above, we want inspect to display instance variables and in a nicer format in a vertical list, without YAML and without having to require a separate module for pp. The result, as expected, is:

ruby-1.9.2-p0 > instance = TestClass.new("A", "B")
=>
#TestClass
Object_i = 2160235560
Instance variables:
@an_instance_variable = A-B
@another_argument = B
@an_argument = A

Of course this is a very simple example, just to give you an idea, but it should be clear that by overriding the behaviour of the inspect method, it is possible to “print” Ruby objects in a prettier and more useful way.

Ready meal: the awesome_print gem

If you don’t like the output of the default “printers”, you don’t really have to customise the result of inspect. You can save some time by using one of the ready gems, libraries and plugins that do this out of the box and are often pretty customisable too.

One of such components which is quite popular according to the download stats on RubyGems, is the awesome_print gem.

Using it, once you’ve installed the gem, is as easy as

ruby-1.9.2-p0 > require "ap"
=> true

ruby-1.9.2-p0 > x = { :a => 1, :b => "a string", :c => [1,2,[3,4]], :d => Time.now }
=> {:a=>1, :b=>"a string", :c=>[1, 2, [3, 4]], :d=>2011-01-17 19:18:03 +0000}

ruby-1.9.2-p0 > ap x
{
:a => 1,
:b => "a string",
:c => [
[0] 1,
[1] 2,
[2] [
[0] 3,
[1] 4
]
],
:d => 2011-01-17 19:18:03 +0000
}
=> nil

As you can see, the output is a lot cleaner and more readable in this example. Unfortunately it doesn’t prettify instance variables as well as the other methods… Remember the TestClass in some of the previous examples?

ruby-1.9.2-p0 > require "ap"
=> true

ruby-1.9.2-p0 > ap TestClass.new("A", "B")
#<TestClass:0x00000102385e98 @an_instance_variable="A-B", @another_argument="B", @an_argument="A">
=> nil

It’s not a big issue; it prettifies attributes and that’s surely very useful, however it would be nice to also see instance variables treated the same way as this can help with debugging at times. One big win with awesome_print, though, is that it works beautifully with complex ActiveRecord models. See the difference if we print with ap an instance of the Category class which we’ve also seen in previous examples and which stores an instance of SeoOptions serialised with YAML in the seo column:

ruby-1.9.2-p0 > require "ap"
=> true

ruby-1.9.2-p0 > ap Category.first
#<Category:0x00000101fab480> {
:id => 1,
:name => "Blogging",
:slug => "blogging",
:posts_count => 0,
:seo => {
"title_tag" => "",
"meta_description" => "",
"meta_keywords" => "",
"noindex" => false,
"nofollow" => false,
"noarchive" => false
},
:type => "Category",
:created_at => Sun, 16 Jan 2011 17:47:48 UTC +00:00,
:updated_at => Sun, 16 Jan 2011 17:47:48 UTC +00:00
}
=> nil

Hurrah! As you can see, ap doesn’t just print the values of the attributes as they are stored in the database, as the other methods do; instead, it prints the output of the reader/getter method for each attribute. So for the seo attribute it correctly displays the nested object rather than the nil value stored in the database.

I find using awesome_print very useful in my applications; I use it a lot in the console, and also in some debugging logs when I need to capture more information than usual and I need this information to be more complete, more readable and user friendly. It’s also useful to integrate awesome_print with IRB: just create a file named .irbrc in the root of your home folder, and paste this snippet in it:

require 'ap'
module IRB
class Irb
def output_value
ap @context.last_value
end
end
end

Once you’ve done this, ap basically replaces inspect:

ruby-1.9.2-p0 > { :a => 1, :b => "a string", :c => [1,2,[3,4]], :d => Time.now }
{
:a => 1,
:b => "a string",
:c => [
[0] 1,
[1] 2,
[2] [
[0] 3,
[1] 4
]
],
:d => 2011-01-17 20:01:48 +0000
}

Since this post is also about how to make the printing of Ruby objects prettier, I should mention too that awesome_print also colorises the output very nicely! The code snippets above use color syntax highlighting already, so you will have to give awesome_print a try to see the difference with the boring mono color output you get with the other methods.

Last, but not least, a word of caution: you may be tempted to use awesome_print to prettify your production logs and make them more complete and useful. I also do, in some “controlled circumstances”, but …beware: because of the way awesome_print works, it can literally kill the performance of your web applications if misused or abused in production environments!

I will explain more in detail, with some metrics, in the next post. In the meantime, you’re warned 😀

Know of other/better ways to prettify Ruby objects in console and logs? Please do let me know in the comments!

A workaround for ActiveRecord’s partial updates not working properly with serialised columns (bug?)

Today, as I am currently working on a side project that uses a lot the tableless_model gem, I came across what looks like a bug in ActiveRecord’s support for partial updates, when used with models that have serialised columns.

Actually, this isn’t the first time I see this kind of issue, but I had completely forgotten about it and therefore the previous releases of the gem were also, although indirectly, affected. I have now pushed a new release (0.0.7) with a fix/workaround, so if you have downloaded and are using the gem, please update it. The workaround I have applied to the gem’s code, which I will describe together with the actual problem in this post, is a pretty simple hack that may also work in other cases, when using ActiveRecord models with serialised columns.

Some context

Since Rails 2.1, ActiveRecord supports without additional plugins a feature called partial updates. If enabled (which is the default setting in Rails), and you change some attributes -but not all- of a model’s instance, upon saving that instance ActiveRecord will generate a SQL update statement that will only include the fields that have actually changed, and that therefore need to be updated, rather than a full SQL update statement that would otherwise include all the fields if partial updates were switched off.

This feature, at least in theory, should help improve performance when updating models, by reducing the size of the SQL statements sent over to the database and then parsed. Partial updates may actually help in some cases, for example with text or -worse- blob columns (oh, my…) that may contain large amounts of data; in these cases, there is a benefit in not sending over the network large chunks of SQL text when those text/blob column haven’t changed and therefore do not need to be updated. But in most other cases, partial updates have little to no effect.

Partial updates work thanks to some other feature usually referred to as dirty attribute tracking. This feature basically keeps track of the state of each attribute of a model from when the model is initialised, to when the model is updated. It works, behind the scenes, by using an instance variable named @changed_attributes that is initially an empty hash when an instance of a model is initialised. Whenever (or almost, as we’ll see) the value of an attribute in the model is changed, its writer/setter method calls the write_attribute method with the new value for the attribute as argument. The write_attribute method then also updates the hash stored in @changed_attributes, adding to it a key/value pair having the name of the changed field as key, and an array in the form [ attribute’s previous value, attribute’s new value ], as value for that key. Borrowing once again the example I showed in the previous post for the tableless_model gem, we’d have something like this:

page = Page.first
=> #<Page id: 1, status: "Published", title: "Original title", content: "...", seo: <#SeoOptions meta_description="" meta_keywords="" noarchive=false nofollow=false noindex=false title_tag="">>

page.changes
=> {}

page.title = "New title"
=> "New title"

page.changes
=> { "title" => [ "Original title", "New title" ] }

As you can see in this example, the hash with the changes is initially empty. Once we change the value of one attribute, then, that hash will only contain that single change with both the previous value and the new value for the attribute title.

When the model’s save method is called, eventually, ActiveRecord converts the changes in @changed_attributes into a SQL update statement similar to this, for the example above:

UPDATE `pages` SET `title` = 'New title', `updated_at` = '2011-01-12 21:05:01'
WHERE (`pages`.`id` = 1)

As you can see, only the field that changed is included in this update statement, apart from the updated_at field which is always updated automatically. As said, this could be useful in some cases: in the example, if the model Page had a large text column and partial updates were switched off, the SQL statement above would also include this text column, however huge its value could be, and all the others, regardless of the fact that only the title had changed. Once the SQL update statement is executed in the database, ActiveRecord will revert @changed_attributes once again to an empty hash.

A few notes:

1) you can switch partial updates off, if needed with:

ActiveRecord::Base.partial_updates = false

2) Apart from reduced network traffic, smaller SQL statements also mean less “noise” in your logs, depending on your logging settings.

3) Simultaneous updates to the same record with partial updates switched on could easily fail, unless you enable optimistic locking; for this to work, your model must also have an integer field named lock_version which is used for some sort of versioning.

4) Apart from the changes method, thanks to the dirty attribute tracking and partial updates, ActiveRecord models also expose other helper methods that can be used to determine the state of each attribute, whether or not a record has changed, and which attributes have changed (only names), if any:

page = Page.first
=> #<Page id: 1, status: "Published", title: "Original title", content: "...", seo: <#SeoOptions meta_description="" meta_keywords="" noarchive=false nofollow=false noindex=false title_tag="">>

page.changed?
=> false

page.title = "test"
=> "test"

page.changed?
=> true

page.title_changed?
=> true

page.content_changed?
=> false

# attributes that have changed - only names
page.changed
=> ["title"]

# getting the change for a specific attribute
page.title_change
=> ["Original title", "title"]

Another thing worth noting about these helper methods, is that if an attribute is changed in some way that bypasses its normal setter, that attribute -somewhat surprisingly- won’t be listed among the changes that need to be saved. An example are string conversion methods:

page = Page.first
=> #<Page id: 1, status: "Published", title: "Original title", content: "...", seo: <#SeoOptions meta_description="" meta_keywords="" noarchive=false nofollow=false noindex=false title_tag="">>

page.title
=> "Original title"

page.title.upcase!
=> "ORIGINAL TITLE"

page.changes
=> {}

As you can see, the tracking of changes didn’t work as you would have expected, in this case. I haven’t investigated really in depth this behaviour, which is one of several things concerning partial updates that I find quite weird; so if there’s a good reason for it, I don’t know yet. Anyway, it is possible to force the detection of this kind of changes by manually calling the method attribute_name_will_change!… before the change happens:

page = Page.first
=> #<Page id: 1, status: "Published", title: "Original title", content: "...", seo: <#SeoOptions meta_description="" meta_keywords="" noarchive=false nofollow=false noindex=false title_tag="">>

page.title
=> "Original title"

page.title_will_change!
=> "Original title"

page.title.upcase!
=> "ORIGINAL TITLE"

page.changes
=> { "title" => [ "Original title", "ORIGINAL TITLE" ] }

As said, this is one of the weird things I have found so far concerning dirty attribute tracking and partial updates, which leads us to the original problem with gem.

The problem

Now that the context on partial updates is -hopefully- clear, it’s time to go back to the original problem I had to fix today with the tableless_model gem.

If you haven’t yet read the previous post in which I described the functionality of this gem, please do as this may clear up a few more things. You may also want to look at the /source for some more details.

In that post, we’ve seen an example on how to use this tiny gem to store a serialised, tableless model – SeoOptions – in a text column of the parent model, Page. The result was that thanks to serialisation, we were able to remove the one-to-one association between these two models and therefore we were also able to remove the seo_options table altogether, reducing database complexity and improving performance by eliminating the need for an additional query (if using :include or lazy loading) to fetch seo options for a given page.

We’ve also seen that to make this work, the gem is using the serialize class method available in ActiveRecord::Base. The problem I noticed once again today is that if serialize is used with an instance of another class, and one attribute of this instance is changed, page.changes won’t list this change in a similar way to what we’ve seen earlier for methods such as upcase!:

page = Page.first
=> #<Page id: 1, status: "Published", title: "Original title", content: "...", seo: <#SeoOptions meta_description="" meta_keywords="" noarchive=false nofollow=false noindex=false title_tag="">>

page.seo
=> <#SeoOptions meta_description="" meta_keywords="" noarchive=false nofollow=false noindex=false title_tag="">

page.seo.title_tag = "111"
=> "111"

page.seo.title_tag
=> ""

page.changes
=> {}

As you can see here, the page doesn’t know about the change and also the change that first seemed to have been applied, seems gone once we access page.seo.title_tag again.

Unfortunately, because in this case we are modifying an instance of another class, and not an attribute of the same one, will_change! won’t work as it does for upcase!:

page = Page.first
=> #<Page id: 1, status: "Published", title: "Original title", content: "...", seo: <#SeoOptions meta_description="" meta_keywords="" noarchive=false nofollow=false noindex=false title_tag="">>

page.seo
=> <#SeoOptions meta_description="" meta_keywords="" noarchive=false nofollow=false noindex=false title_tag="">

page.seo_will_change!

page.seo.title_tag = "111"
=> "111"

page.seo.title_tag
=> ""

page.changes
=> {"seo"=>[{"title_tag"=>"", "meta_description"=>"", "meta_keywords"=>"", "noindex"=>false, "nofollow"=>false, "noarchive"=>false}, {"title_tag"=>"", "meta_description"=>"", "meta_keywords"=>"", "noindex"=>false, "nofollow"=>false, "noarchive"=>false}]}

Here you can see that will_change! forces seo into the list of changes, however the actual change of the title_tag to “111” was basically lost.

The fix/workaround

To fix this issue, which basically means any changes made to a serialised class directly, are simply lost when you save the parent model, I have made a few changes to the tableless_model (please refer to the source code for more details).

Firstly, in the TablelessModel class I defined two writeable accessors:

class TablelessModel < Hash
...
attr_accessor :__owner_object, :__serialized_attribute
...
end

Then, in the has_tableless class method injected in ActiveRecord::Base, I set these two accessors to respectively the instance of the parent model, and the name of the column that will store the instance of the tableless model in YAML-serialised format:

...
define_method column_name.to_s do
instance = class_type.new(read_attribute(column_name.to_sym) || {})

instance.__owner_object = self
instance.__serialized_attribute = column_name

instance
end
...

Lastly, in the writer method for the tableless model’s attributes (remember that TablelessModel is a specialised hash behind the scenes), there’s a call to the parent/owner object. Each time an attribute of the serialised tableless model is changed, the setter for the serialised column in the parent object is automatically called to update its content with the updated tableless model:

...
def []=(attribute_name, value)
raise NoMethodError, "The attribute #{attribute_name} is undefined" unless self.class.attributes.has_key? attribute_name.to_s

return_value = super(attribute_name.to_s, self.class.cast(attribute_name, value))

if self.__owner_object
# This makes the tableless model compatible with partial_updates:
# whenever a property in the tableless model is changed, we force the parent/owner object to a changed state
# by updating it with a new, updated instance of the tableless model
self.__owner_object.send "#{self.__serialized_attribute.to_s}=".to_sym, self
end

return_value
end
...

The effect is similar to what would happen if we manually updated the serialised attribute in the parent model, using the options= setter (as per the previous example) and setting it to the new, updated, version of the tableless model:

page = Page.first
=> #<Page id: 1, status: "Published", title: "Original title", content: "...", seo: <#SeoOptions meta_description="" meta_keywords="" noarchive=false nofollow=false noindex=false title_tag="">>

page.seo
=> <#SeoOptions meta_description="" meta_keywords="" noarchive=false nofollow=false noindex=false title_tag="">

page.seo.title_tag = "CHANGED!"
=> "CHANGED!"

page.seo.title_tag
=> "CHANGED!"

page.changes
=> {"seo"=>[{"title_tag"=>"", "meta_description"=>"", "meta_keywords"=>"", "noindex"=>false, "nofollow"=>false, "noarchive"=>false}, {"title_tag"=>"CHANGED!", "meta_description"=>"", "meta_keywords"=>"", "noindex"=>false, "nofollow"=>false, "noarchive"=>false}]}

Bingo! As you can see, the change in the serialised model is no longer lost now, and will be persisted correctly to database when saving.

As I’ve said, I can see the benefits of partial updates in some particular cases, but this feature often causes more issues than the good it brings. I know of at least another issue caused by it, with models that have composite keys; however I will need to check if the problem still exists, and if it does, it will be subject for another post.

In the meantime, this workaround seems to be working as expected. So, beware of these issues with partial updates anyway, should you use serialised columns in your applications!

Painless, ultra fast hot backups and restores of MySQL databases with Percona’s XtraBackup

A better alternative to mysqldump?

In the previous post, we’ve seen how to perform backups and restores of MySQL databases using the mysqldump tool bundled with the MySQL server distribution, together with a few tricks that help speed up both backups and restores, among other things.

mysqldump is very likely the most popular tool for backing up MySQL databases, and in most cases it just works, and pretty well too. It’s easy to use, free, comes with MySQL and produces simple text files containing standard DROP/CREATE/INSERT SQL statements that, once replayed during a restore, will simply recreate the databases, the tables and all the data on the destination MySQL server. By producing simple text files based on SQL statements, mysqldump may also be the perfect tool when migrating data between different versions of MySQL, or when you need to change something in the text file dump before restoring it.

However, mysqldump may not be up to the job if you need to back up MySQL databases that you want to keep active without any downtime, and while they are in use for writes as well as reads; for example, when the databases are also replicated and you need to copy them to another server instance to set up a new slave or resync an existing one. In these circumstances, if you resort to using mysqldump anyway you may have to choose between locking all the tables while backing up -thus blocking writes in the meantime, which in some cases may mean breaking applications- and having dumps that may contain inconsistent data. But you unlikely want your users to experience downtime, and in any case you don’t want that your databases may change while your are exporting the data, do you?

Another issue is performance. While mysqldump works great with small databases, it is not efficient with large ones and restoring a SQL dump of a large database can take a very, very long time. With the tricks we’ve seen in the previous post, it is possible to speed up both dumps and restores significantly, but still this would not work well for very large databases. So, unless you are very patient, don’t mind waiting and can afford to stay sitting a week watching your MySQL server while it’s restoring your large db, sooner or later you will have to look for other, more efficient options.

LVM snapshots

Among the other options, there are both commercial and free tools. One popular (free) choice is to use LVM snapshots; they work well when you can use them (for example, you may not be able to use LVM snapshots with virtual private servers, in many cases, unless you are given the possibility to customise the disk layout setup), but they’ve also got their weaknesses; we’ll look at this option more in detail in a future post, however suffice it to say here that LVM snapshots may significantly impact on the disks, and therefore on the performance of the whole server. Plus, LVM snapshots only take into account what MySQL has already flushed to disk at the moment, but not what data it may have in memory, thus restoring from an LVM snapshot may require a crash recovery.

Hot backups!

A better alternative yet is MySQL Enterprise Backup (formerly InnoDB Hot Backup), which can backup a live MySQL server without downtime and at the same time ensure that our backups are consistent even if the databases are in use for writes too during the process. Plus, restores are incredibly fast. Sounds great, yeah? Unfortunately, MySQL Enterprise Backup is a pricey commercial product that may be beyond reach in many cases (5K USD per server), especially for companies with tight budgets that may choose MySQL for the its cost (none) in first place, as well as for its good performance.

But fear not, if you are on the cheap like me: luckily for us, there also exists a free, open source solution by the good people at Percona (a MySQL consultancy company), which is very, very similar to Enterprise Backup/InnoDB Hot Backup. It’s called Xtrabackup, and while it’s still pretty young having been first developed less than a couple years ago, it’s a pretty solid solution that works very well despite some limitations. It also compares well to the commercial alternative in that it only takes a bit longer to perform backups of the same data, and requires a little more disk space – I haven’t had a chance to test this myself yet, so I’ll trust what I’ve read on pythian.com: Taste test: Innobackup vs. Xtrabackup.

Here are the main features available with the current release, from the home page:

  • Create hot InnoDB backups without pausing your database
  • Make incremental backups of MySQL
  • Stream compressed MySQL backups to another server
  • Move tables between MySQL servers online
  • Create new MySQL replication slaves easily
  • Backup MySQL without adding load to the server

While the product itself is free, Percona also provides commercial support, if needed. So, let’s see now how to use Xtrabackup for backing up and restoring MySQL databases.

Installing

For starters, you’ll need to install XtraBackup. In this example, we’ll install a Debian package on a Ubuntu server. Head to XtraBackup’s download page first, or download the latest version available (1.4 at the moment of this writing) with wget or similar making sure you choose the right package for your OS.

wget https://www.percona.com/downloads/XtraBackup/XtraBackup-1.4/deb/lucid/x86_64/
xtrabackup_1.4-193.lucid.25_amd64.deb
dpkg -i xtrabackup_1.4-193.lucid.25_amd64.deb
Backing up

The package installs a binary named xtrabackup that takes care of the actual backups and restores, but it also installs a perl script named innobackupex-1.5.1 that makes XtraBackup a bit easier to use, plus it also backs up MyISAM tables. Backing up is pretty simple:

innobackupex-1.5.1 --user=... --password=... --slave-info \
--defaults-file=/etc/mysql/my.cnf --databases="db1 db2" --stream=tar ./ \
| gzip -c -1 > /backups/backup.`date +%Y-%m-%d-%H-%M-%S`.tar.gz

You should see an output similar to this:

This software is published under
the GNU GENERAL PUBLIC LICENSE Version 2, June 1991.
IMPORTANT: Please check that the backup run completes successfully.
At the end of a successful backup run innobackupex-1.5.1
prints "completed OK!".
innobackupex-1.5.1: Using mysql Ver 14.14 Distrib 5.1.41, for debian-linux-gnu (i486)
using readline 6.1
innobackupex-1.5.1: Using mysql server version Copyright 2000-2008 MySQL, 2008 Sun
Microsystems, Inc.
innobackupex-1.5.1: Created backup directory /backups
101214 21:47:59 innobackupex-1.5.1: Starting mysql with options:
--defaults-file="/etc/mysql/my.cnf" --password=... --user=... --unbuffered --
101214 21:47:59 innobackupex-1.5.1: Connected to database with mysql child process
(pid=17917)
101214 21:48:03 innobackupex-1.5.1: Connection to database server closed
101214 21:48:03 innobackupex-1.5.1: Starting ibbackup with command: xtrabackup
--defaults-file="/etc/mysql/my.cnf" --backup --suspend-at-end --log-stream
--target-dir=./
innobackupex-1.5.1: Waiting for ibbackup (pid=17944) to suspend
innobackupex-1.5.1: Suspend file '/var/lib/mysql/xtrabackup_suspended'
xtrabackup: suspend-at-end is enabled.
xtrabackup: uses posix_fadvise().
xtrabackup: cd to /var/lib/mysql
xtrabackup: Target instance is assumed as followings.
xtrabackup: innodb_data_home_dir = ./
xtrabackup: innodb_data_file_path = ibdata1:10M:autoextend
xtrabackup: innodb_log_group_home_dir = ./
xtrabackup: innodb_log_files_in_group = 2
xtrabackup: innodb_log_file_size = 5242880
xtrabackup: Stream mode.
>> log scanned up to (92295311910)


xtrabackup: Transaction log of lsn (92294278211) to (92298258089) was copied.
101214 22:47:23 innobackupex-1.5.1: All tables unlocked
101214 22:47:23 innobackupex-1.5.1: Connection to database server closed
innobackupex-1.5.1: Backup created in directory '/backups'
innobackupex-1.5.1: MySQL binlog position: filename 'mysql-bin.000730',
position 16838362
innobackupex-1.5.1: MySQL slave binlog position: master host '192.168.162.110',
filename 'mysql-bin.000651', position 99728693
innobackupex-1.5.1: You must use -i (--ignore-zeros) option for extraction
of the tar stream.
101214 22:47:23 innobackupex-1.5.1: completed OK!

In this first example, we have instructed XtraBackup to stream the backup data in tar format; the output is then compressed with gzip into a file having the current timestamp in its name. You may have noticed that I am using in the example above gzip’s lowest compression level, this is because higher levels will only require more CPU and resources in general with little improvement, especially if you pipe the compression to XtraBackup while the backup is still in progress; actually I have had often problems with compressing the backup data while the backup is still running; in many cases I found that backing up first, and then compressing the backup once this is done, works better. However the success of compressing while backing up at the same time, also depends -from my experience with the tool- on the value of the wait_timeout system variable in MySQL. On my servers, whenever I have set this variable to a low value I’ve always had problems getting backup and compression to work in one step. There may be better settings or workarounds, but I’ve found that setting wait_timeout to a value higher than the average time taken by the backup process, usually makes for a successful compressed backup without requiring a separate step for the compression. However, keep in mind that a high value for wait_timeout can cause a whole lot of other issues depending on the usage pattern of your MySQL instances.

For this reason, you may want to keep wait_timeout to a not too high value, and only change it for the XtraBackup transaction. You can easily do this by patching the innobackupex-1.5.1 script as suggested by reader “Gu Lei” in a bug somebody filed exactly for the same issue. So edit the file /usr/bin/innobackupex-1.5.1 and change it as follows:

if (compare_versions($mysql_server_version, '4.0.22') == 0
|| compare_versions($mysql_server_version, '4.1.7') == 0) {
# MySQL server version is 4.0.22 or 4.1.7
mysql_send "COMMIT;";
mysql_send "set interactive_timeout=28800;"; # ADD THIS
mysql_send "set wait_timeout=28800;"; # ADD THIS
mysql_send "FLUSH TABLES WITH READ LOCK;";
} else {
# MySQL server version is other than 4.0.22 or 4.1.7
mysql_send "set interactive_timeout=28800;"; # ADD THIS
mysql_send "set wait_timeout=28800;"; # ADD THIS
mysql_send "FLUSH TABLES WITH READ LOCK;";
mysql_send "COMMIT;";
}

Of course, the timeout value must be enough to cover the amount of time XtraBackup could possibly take to backup your data.

The arguments –slave-info and –databases are optional. The former allows you to save the master status info together with the data (assuming the server you are backing up data from is a master or an existing slave in a replicated environment), in case you are in the process of setting up a new slave, or resyncing an existing one. The latter, instead, allows you to specify which databases you want to backup unless you want to backup all the available databases; it accepts a space separated list of database names. Obviously, if your MySQL configuration file is in another location on your server, remember to change the path in the relevant argument accordingly.

Streaming backups to another MySQL server

The command above works pretty well if you want to make backups manually or if you want to schedule, for example, daily, full backups with cron. However if you are backing up with the purpose of restoring to another server, you may want to skip creating a local file that then needs to be transferred to the destination server for restore. You can stream Xtrabackup‘s output directly to the destination server instead, and create a backup archive there, by using netcat.

Netcat, which we have already met in the previous post on backing up with mysqldump, is a simple yet very useful utility that allows you to stream data over a TCP connection, then you can for example save that data to a file on the destination host. First, let’s get netcat running on the destination host and listening to a port of our choice, for example 6789, and by redirecting the output we’ll store it into a file on disk:

nc -l 6789 > /backups/master-data.tar

Then, on the server that hosts the data we want to migrate:

innobackupex-1.5.1 --user=... --password=... --slave-info \
--defaults-file=/etc/mysql/my.cnf --databases="db1 db2" --stream=tar ./ \
| nc xxx.xxx.xxx.xxx 6789

where xxx.xxx.xxx.xxx is the IP address of the destination host. XtraBackup will take its time depending on the amount of data to transfer, and once the process is completed netcat will be automatically closed on the destination host, with all the data copied into the target file. If all went well, XtraBackup will complete the process on the source host saying:

innobackupex-1.5.1: You must use -i (--ignore-zeros) option
for extraction of the tar stream.
101214 22:47:23 innobackupex-1.5.1: completed OK!

If not, well, you are going to have trouble, of course.

Restoring

If backing up with XtraBackup was easy, restoring isn’t any more difficult, it just requires a few additional steps vs a single one. First, while on the destination host (we are done on the source host), backup the existing MySQL data folder, if needed, and then delete or rename it, so that we can create a new empty data folder in the same location which will contain the restored data.

service mysql stop # you need to shut down MySQL before restoring
mv /var/lib/mysql /var/lib/mysql.bak && mkdir /var/lib/mysql

Next, extract the tar archive created while streaming the data during the backup, into the new MySQL data folder. As XtraBackup reminds us at the end of the backup process, we need to use the –ignore-zeros option (or -i) when untarring. Also remember to use the option -z as well if you are restoring from a gzipped tarball – here, for example’s sake, I am assuming you have used netcat instead.

cd /var/lib/mysql && tar xfvi /backups/master-data.tar

Unless something is wrong with the archive for some reason, you should now have a new MySQL data folder identical to that on the source host at the moment the backup was performed. Only thing, this data folder is not ready yet for use with MySQL but needs to be “prepared” first to apply the changes that have been captured by XtraBackup in the transactional logs during backup. Once this is done, fix the permissions and start MySQL again:

cd /var/lib/mysql && innobackupex-1.5.1 --apply-log --ibbackup=xtrabackup_51 \
./ && chown -R mysql:mysql /var/lib/mysql

service mysql start

If all went well so far, MySQL should start correctly with the restored database(s).

Setting up the new/updated instance as replication slave

Assuming now you want to use this new server as a slave of the instance you have backed data from, and know how to set up the replication (we’ll look at how to do this in another post) or that this instance already was a slave and you just need to resync it, you will have to update the information that MySQL uses to connect to the master and replicate data. If you used the option –slave-info with XtraBackup during the backup, like in my example, in the new data folder you should also see some additional files that will contain exactly this information.

Nothing complicated here… but you need to be careful as to which file you should look at. You have two possibilities here:

Scenario 1 – You are cloning an existing slave

That is, you are adding a new slave. Let’s call the host you previously backed the data from Slave-0. Slave-0 itself was already configured as slave of another MySQL server instance, its master, that we’ll call …Master. Right now you basically want to “clone” Slave-0. That is, you want to add a new slave to Master. Let’s call this new slave Slave-1. So, to recap, both Slave-0 AND the new Slave-1 will in the end be replicating from the same Master, while there won’t be any link between Slave-0 and Slave-1. Hope this is clear! 🙂

If this is what you are trying to do here, the file you need to look at is xtrabackup_slave_info. If you look at the content of this file,

cat /var/lib/mysql/xtrabackup_slave_info
CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000834', MASTER_LOG_POS=50743116

you’ll see that it already has a “CHANGE MASTER..” command ready for you to use. So open a MySQL console, and just execute that command:

mysql> stop slave;
#optional IF this instance was already a slave AND MySQL IS configured
without the skip-slave-start OPTION
mysql> CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000834',
MASTER_LOG_POS=50743116;
mysql> START SLAVE;
mysql> SHOW SLAVE STATUS \G

If everything was OK, you’ll see something like this:

...
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
...
Seconds_Behind_Master: 1643
...

Seconds_Behind_Master is an estimation (not too accurate, I must say) of the time in second it will take for this new slave to catch up with the master. How long it takes depends on how long before the backup you’ve used to build this slave was taken from the existing slave you have cloned, and on how much data has changed in the meantime on the master.

Update Tue, May 17 2011: As reader Mora points out in the comment, this definition of Seconds_Behind_Master is not correct.

Seconds_Behind_Master is how many seconds the slave is behind the master. It is not an estimation of how long its going to take to catch up, if the slave has been stopped for 1hour, it would be 3600seconds behind, but it can take anywhere from a few seconds, to a few days (or more) to catch up.

Thanks to Mora for the clarification.

Note: you may have noticed that the file containing the information about the master, only contains the name and position of the master’s binary log file in use at the moment the backup was taken; there is no username, or password. This is because credentials are not saved there for obvious reasons. So unless you have added values for MASTER_HOST, MASTER_USER and MASTER_PASSWORD to your my.cnf file, you may need to add these settings to the “CHANGE MASTER..” command above.
Scenario 2 – You are setting up a new slave, or resyncing an existing slave having as master the MySQL instance you’ve backed data from in the previous steps
In this second scenario, instead, say that

  • you want to use the MySQL server instance you previously backed your data from, as master in a typical replication; let’s call this instance Master;
  • you want to use the new instance you are currently setting up from that backup, as Slave. Or, the master-slave replication between these two already existed but you need to resync this slave.
    If this is what you are trying to do, the file you need to look at is xtrabackup_binlog_info instead. If you look at the content of this file,
cat /var/lib/mysql/xtrabackup_binlog_info
mysql-bin.000228 56875717

you’ll see something slightly different to what the other file contains. The first value is the name of the binary log file on the master, which you need to attach this slave to; the second value is instead the position within that binary log from which MySQL has to start catching up with changes from the master. Optionally, you may also see a third value being the list of names of the databases excluded from the replication.

Now you can use this information to update the link with the master. Start a MySQL console and enter the following commands to do this, then start the slave:

mysql> CHANGE MASTER TO MASTER_HOST='...', MASTER_USER='...',
MASTER_PASSWORD='...', MASTER_LOG_FILE='mysql-bin.000228',
MASTER_LOG_POS=56875717;
mysql> START SLAVE;
mysql> SHOW SLAVE STATUS \G

Also in this case, you’ll have to enter the username and password you are using for the replication process, and be careful to enter the log file name and position correctly, according to your xtrabackup_binlog_info file.

Finally, if all worked as expected, you should see:

...
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
...
Seconds_Behind_Master: 2184
...

Cool, you’re done.

I am pretty sure you’ll see right away – as soon as you try XtraBackup – the huge benefits of this alternative vs mysqldump. Make sure, though, you test backups and restores enough before using this system in production!

Know of any other tips related to XtraBackup, or of a similar or even better alternative (still free)? Please share them in the comments!

Smarter, faster backups and restores of MySQL databases using mysqldump – and other useful tips

Introduction

This is the first part of a mini series that will cover various topics related to MySQL administration, as well as several tips and tricks on using MySQL in our applications. It will be a way for me to finally reorganize some notes in a more readable format, and at the same time it will be a good chance to share these notes, hoping you also find them useful! In this first part, I thought I’d put together some sort of “reference” on how to use mysqldump for backups, how to restore these dumps, and pretty many cool tricks in between.

If you work with MySQL at any level, chances are you may need to backup your databases occasionally or on a regular basis. MySQL is a great RDBMS: it performs very well in most scenarios, has got all you need for most applications, and it’s free (and hopefully will remain so), coming bundled with a number of useful tools and utilities, also free.

But one thing about MySQL that has always disappointed me, though, is that while there is a choice of free tools for backing up databases, the tools that come bundled with MySQL server for this purpose aren’t really great. Absolutely the most popular one among these is mysqldump, as it’s very easy to use, and performs backups by producing simple text files containing DROP/CREATE/INSERT SQL statements; once replayed on the destination MySQL instance, these statements will recreate all the databases, tables and data. mysqldump can also create dumps in other formats such as CSV and XML, however SQL is the most used format.

I am not a particularly passionate fan of mysqldump, as I prefer using alternative tools that are also free and that perform dramatically better than mysqldump in a number of cases. One of the main reasons though, why I don’t like mysqldump too much and try to avoid using it whenever I can, is performance. mysqldump works just fine with small databases, but it can be a terrible pain with large or very large databases, especially if you are so unlucky to have to restore one! Pick any database with tables containing more than a few million rows, and try restoring it. Unless you are very patient and are not in rush, it is likely that it’ll take longer than you think would be reasonable, and perhaps you’ll give up and look for something else.

However, mysqldump is, still, the only backup tool that most MySQL users know and use, and it’s available wherever MySQL is. Also, at times a plain SQL-based backup may be useful, as it can often be restored more easily to a different version of MySQL without particular difficulties.

So, here are a few useful tips that can help you save some time when backing up MySQL databases with mysqldump, and restoring any of these dumps, plus some nice tips.

First, the basics…

Backups, or “dumps”, are extremely easy to create with mysqldump. Here are a few examples:

# Backup a single database
# NOTE: you can also use the shortcuts -u and -p to specify username and password
mysqldump --user=... --password=... your_database_name > your_database_name.sql

# Backup multiple databases:
mysqldump --user=... --password=... --databases db1 db2 > backup.sql

# or all the databases in one shot:
mysqldump --user=... --password=... --all-databases > backup.sql

# Optionally, it's easy to compress the backup, for example with gzip:
mysqldump --user=... --password=... your_database_name \
| gzip > your_database_name.sql

# And, if your database also has other objects (apart from tables) such as functions,
# views, and stored procedures, you can back them up too with --routines:

mysqldump --user=... --password=... --routines \
your_database_name > your_database_name.sql

# It may also be useful to include a timestamp in the target file name, so to know
# right away when a backup was taken:

mysqldump --user=... --password=... \
your_database_name > "your_database_name-$(date +%Y-%m-%d-%H.%M.%S).sql"

# If you are backing up data with the purpose of restoring it to an instance you want
# to use as replication slave, then the option --master-data is handy as it adds to
# the dump the information needed to configure the connection with the
# replication master:

mysqldump --user=... --password=... --all-databases --master-data > backup.sql

Restoring is pretty simple too. Usually it is done this way:

mysql -u.. -p... db_name < db_name.sql

The command above is very similar to the previous one, but here we tell MySQL client to import and replay the SQL statements contained in our backup file (or dump). As we’ll see later in this post, there are also smarter ways of backing up and restoring dumps created with mysqldump, that would work better in some cases.

Speeding things up

Dumps

With the most recent versions of MySQL commonly in use these days, the single most useful option available with mysqldump is –opt. This option is actually a sort of shortcut for a group of other options; among these, there are a few ones that can help speed up backups, but also -more importantly- restores.

Update: it looks like the –opt option is now switched on by default on the latest versions of MySQL, so the following details are for reference in case you use a very recent version (and you should!).

These options are:

–add-locks

Affects restores: this option ensures that each table is locked while restoring, so to allow dropping and recreating the tables. At the same time, because a table remains locked to other transactions while restoring the data, inserts happen more quickly, therefore reducing the time taken to restore the content of the table.

–create-options

Affects restores: makes the creation of a table quicker by merging into the CREATE TABLE statement anything that has to do with defining the structure of the table.

–disable-keys

Affects restores: it helps when restoring databases using MyISAM as storage engine. Delays the creation of the indexes for a table until all the data in that table has been restored. This results in an overall faster restore of the table vs updating indexes while restoring the data.

–extended-insert

Affects both dumps and restores: this can speed up A LOT restores, as it produces in the final SQL dump INSERT commands with multiple sets of values, resulting in the insertion of multiple rows at once. As a side benefit vs having a separate INSERT statement for each row, the resulting SQL dump will also be smaller, taking up less storage space.

–lock-tables

Affects dumps: improves dumping of MyISAM tables, by locking all the tables during the dump.

–quick
Affects dumps: when dumping large tables, this option prevents buffering the whole tables before dumping them to the backup file. Instead, rows are fetched and dumped right away to file, resulting in an overall faster and lighter dump thanks to reduced load on the system.

Restores

As I’ve just suggested, the –opt argument also helps with speeding up restores. However there’s another trick that I use whenever I need to restore a dump, as it can save a lot of time. When a dump is being restored, I simply disable a number of checks that MySQL has to perform to ensure the integrity of the data, with foreign keys and more, and then I enable these again soon after the data has been completely restored.

One problem is that if the dump file I need to restore are pretty large, it’s not a good idea to edit the content of the file to make these changes. So.. cat to the rescue!

I basically use cat to produce a new string output containing the changes I just described plus the content of the original dump file, and then I stream this output directly to the target MySQL instance for restore:

(
echo "SET AUTOCOMMIT=0;"
echo "SET UNIQUE_CHECKS=0;"
echo "SET FOREIGN_KEY_CHECKS=0;"
cat dump.sql
echo "SET FOREIGN_KEY_CHECKS=1;"
echo "SET UNIQUE_CHECKS=1;"
echo "SET AUTOCOMMIT=1;"
echo "COMMIT;"
) | mysql -u... -p... target_database

Believe me, I would never want to restore a large database without this trick! Another tip that I’d like to suggest is having a shell function ready to use so you won’t have to type this command whenever needed. For example:

MYSQL_USER="..."
MYSQL_PASSWORD="..."

function restore() {
echo $1;
(
echo "SET AUTOCOMMIT=0;"
echo "SET UNIQUE_CHECKS=0;"
echo "SET FOREIGN_KEY_CHECKS=0;"
cat "$1.sql"
echo "SET FOREIGN_KEY_CHECKS=1;"
echo "SET UNIQUE_CHECKS=1;"
echo "SET AUTOCOMMIT=1;"
echo "COMMIT;"
) | mysql -u"$MYSQL_USER" -p"$MYSQL_PASSWORD" "$1"
}

You can then use this function this way:

restore db_name

Which expects there’s a file named db_name.sql.
The above is just an example. As we’ll see later, you may want to add this shortcut to another shell function I am suggesting in the section “Other useful tips”.

Working with remote hosts

When you work with different environments (such as development, testing, staging, quality assurance, production, etc.), you most likely have to deal with different hosts, often located in different data centres, perhaps in different geographical regions and far away from each other. In these cases, it may be difficult to transfer MySQL dumps from an environment to another, and then restore them to different server instances. This is especially true if the connection available between the two ends is not speedy, or if the database dumps -even if compressed- are too large.

In such cases, instead of dumping databases to files, migrating these files to a different host over the Internet or other connection, and restoring the data to the target host, it is possible to skip the intermediate step and restore directly to the target host while the data is still being backed up. This is called streaming, and can be done in various ways.

Between hosts sharing the same network / between two MySQL instances on the same host / between two hosts with no firewall restrictions

The most straight forward scenario is when the source MySQL instance and the target instance share either the same host or at least the same network, or they are even in different networks and hosts but are allowed to communicate directly with each other. That is, there are no firewall rules forbidding this point-to-point communication. In these cases, you can dump data from an instance and restore it to the other instance while the dump is still being performed:

mysqldump --u... -p... --opt source_db \
| mysql -u... -p.... --host=target_host target_db
With SSH

When direct communication between the hosts is forbidden, or when the data is sensitive and you need encryption, you can use SSH. In the following example, a database is being restored directly from a remote host to a local instance:

ssh user@remote_host mysqldump -u... -p... --opt source-db \
| mysql -u... -p... target_db

Similarly, if you want to restore a local database to a remote host:

mysqldump -u... -p... source-db \
| ssh user@remote_host mysql -u... -p... --opt target_db

If you, instead, only want to dump a remote database locally, or dump a local database to a remote file system:

# dumping remote => local
ssh user@remote_host 'mysqldump -u... -p... dbname \
| gzip' > backup-`date +%Y-%m-%d`.sql.gz

# dumping local => remote
mysqldump -u... -p... dbname | gzip \
| ssh user@remote_host "cat > backup-`date +%Y-%m-%d`.sql.gz"
With netcat

I wanted to mention a third option that uses netcat and that perhaps is not very popular, but that I like a lot and works really well when there are no firewall restrictions between source and target MySQL instances. I wouldn’t use this technique when I need to transfer sensitive data as it does not use encryption as SSH does, however whenever this is not an issue transferring data this way can be a lot speedier, especially if I need to transfer really large databases!

Between SSH and netcat, it is perhaps more correct to talk about streaming if the tool in use is netcat. So it requires two simple steps. First, we need to tell netcat to listen on a custom port -for example 4567- on the target host, and to save the data it will receive from the source host to a local file:

nc -l 4567 > backup.sql.gz

As you can see, I’ve appended the extension .sql.gz to the file’s name. This is because netcat will be receiving compressed data, as we have seen in other examples. Next, we’ll execute mysqldump on the source host, and stream it output -compressed- to the target host:

mysqldump -u... -p... --opt dbname | gzip | nc -w1 target_host_ip_or_hostname 4567

One cool thing is that netcat will automatically stop listening, and will close the target dump file, as soon as the dump is completed and therefore the streaming is stopped.

Other useful tips

Exporting only the schema, not the data

If you want to quickly create a database having the same schema as an existing database, you can dump only the schema, but not the data, from the existing one and then restore the dump as usual:

mysqldump -u... -p... --no-data dbname > dbname.sql
Dumping one or multiple databases and automatically deleting dumps older than X days

If you backup your database(s) regularly, for example with a cron job, you may want to keep an eye on the disk space used by backups. It is easy to automatically delete backups older than X days, as follows:

mysqldump ... && find /backup-location/* -mtime +15 -exec rm {} \;

In the example above, we dump the databases first (I have omitted arguments for mysqldump as these will depend on your backup needs), then we find and delete backups older than 15 days.

Dumping multiple databases, but to separate files

As we’ve seen in the second section, the option –all-databases gives you the possibility to backup all the databases available. However, the databases will all be backed up to the same dump file, so sometimes you may prefer having a separate dump for each database. This can be useful in the case you need to restore one database, but need to leave the other ones alone. First, we need to find out which databases are available, and perhaps exclude some of them from the backup. Say, for instance, that you want to backup all the databases but mysql, information_schema. First, we list all the databases with

mysql -u... -p... -e "show databases;"

You should see something similar to this:

+-----------------------------+
| Database |
+-----------------------------+
| information_schema |
| database1 |
| database2 |
| mysql |
+-----------------------------+

We need to manipulate this output a little bit, so that we can extract from it just the database names, filtering out all the other characters that are not needed:

mysql -u... -p... -e "show databases;" | tr -d "| "

Which produces:

Database
information_schema
database1
database2
mysql

Almost there. We now want to remove the Database heading, and also exclude the databases mysql and information_schema, as per the example (unless you want to back them up too):

mysql -u... -p... -e "show databases;" \
| tr -d "| " \
| egrep -v "(Database|mysql|information_schema)"

Here we are:

database1
database2

We now have the clean list of the names of the databases we want to backup to separate files. All we have to do is chain this command with a simple for..do loop that will execute mysqldump for each database:

for db in \
`mysql -u... -p... -e "show databases;" \
| tr -d "| " \
| egrep -v "(Database|mysql|information_schema)"`; \
do mysqldump -u... -p... --opt --routines --databases $db > "$db.sql"; done

By executing the command above, you will create as many .sql dump files as the number of databases you have backed up.

Restoring from multiple dumps

If you dump each database to a different file, you can restore all these databases -should you need to restore all of them- this way:

for file in `ls *.sql`; do echo $file && mysql -u... -p... \
"`echo $file | sed "s/\.sql//"`" < $file; done

Note: Some frameworks, such as Rails and -usually- others based on Ruby, encourage using a different database for each environment, for a number of reasons. Convention wants that these databases be named after the environment, for example: blog_development, blog_test, and blog_production. Sometimes you may need to restore, for example, production databases to a development environment, perhaps to test some new code on live data. How would you do this, if you have backed up the production databases to separate files as described in the previous tip? In this case, your dumps would be named like dbX_production.sql (or something similar), while your development databases -provided they already exist and that you want to overwrite them with production data- would be named like dbX_development.

You can restore all the databases from the dumps with one command as follows:

for file in `ls *.sql`; do echo $file && sed "s/production/development/" $file \
| mysql -u... -p... "`echo $file | sed "s/production/development/" \
| sed "s/\.sql//"`" ; done

In the example above, we use sed to replace the “production” with “development”. So, for example the dump named db_production.sql would be restored to the database db_development without having to rename files, which can be useful if you want somehow to automate these commands.

Shell shortcuts

As I need often these commands and others, I find it useful to create some bash aliases and functions (as I usually use bash as shell), so to save some typing. I usually add to my bash profile these lines:

MYSQL_USER="..."
MYSQL_PASSWORD="..."
ENVIRONMENT="development"

mysql() {
case $1 in
alldbs ) mysql -u"$MYSQL_USER" -p"$MYSQL_PASSWORD" -e "show databases;" \
| egrep -v "(Database|mysql|information_schema)" ;;
backup ) <commands seen in previous section> ;;
restore ) <commands seen in previous section> ;;
* ) eval '`which mysql` -u"$MYSQL_USER" -p"$MYSQL_PASSWORD" $@';;
esac
}

# usage:
# mysql alldbs => to list the available databases
# mysql backup => to dump all the databases
# and so on...

So I just add to that function as many shortcuts as I need for the commands that I use most often. Of course, if you create such shortcuts in a production environment that can be accessed by others, or just for improved security, you may want to avoid adding your credentials in the bash profile.

…and One More Thing

In these pages I am sharing with you several useful tips on mysqldump, but hold on! As promised in the introduction this is only part of a sort of mini series on MySQL, and certainly not the last on backing up and restoring MySQL data.

As I’ve already mentioned, I only use mysqldump with small databases, while with large databases or with more complex setups (replication to name one), I prefer using other tools.

I will publish pretty soon another post describing these tools, and how to use them in those more complex setups where mysqldump, simply put, may not be enough.

So, stay tuned! In the meantime, if you find this sort of mini “reference” on mysqldump & co. useful, don’t forget to subscribe to my feed and to share with others!