has_many :codes

Resque: automatically kill stuck workers and retry failed jobs

Published  

Resque is a great piece of software by Github that makes it really easy to perform some operations (‘jobs’) asynchronously and in a distributed way across any number of workers. It’s written in Ruby and backed by the uber cool Redis key-value data store, so it’s efficient and scalable. I’ve been using Resque in production for a couple years now after it replaced Delayed Job in my projects, and love it. If your projects do something that could be done asynchronously, your really should check it out if you haven’t yet.

At OnApp we’ve been using Resque for a while to process background jobs of various types, with great results: in a few months, we’ve processed a little over 160 million jobs (at the moment of this writing), and out of this many only 43K jobs have been counted as failed so far. However, many of these failed jobs have been retried successfully at a successive attempt, so the number of jobs that actually failed is a lot smaller, perhaps a very few thousands.

Out of 160M+ jobs, it’s a very small percentage of failures. But despite the system, for the most part, has been rock solid so far, jobs can still fail every now and then depending on the nature of the jobs, excessive load on the worker servers, temporary networking and timeout issues or design related issues such as race conditions and alike. Sometimes, you will also find that workers can get “stuck”, requiring (usually) manually intervention (as in: kill / restart the workers, manually sort out failed jobs).

So I wanted to share a simple script I am using in production to automatically find and kill these “stuck” workers and then retry any jobs that are found as ‘failed’ due to the workers having been killed, or else. The purpose is to keep workers running and minimise the need for manual intervention when something goes wrong.

Please note that I use resque-pool to manage a pool of workers more efficiently on each worker server. Therefore if you manage your workers in a different way, you may need to adapt the script to your configuration.

You can find the little script in this gist, but I’ll briefly explain here how it works. It’s very simple, really. First, the script looks for the processes that are actually working off jobs:

root@worker1:/scripts# ps -eo pid,command | grep [r]esque
10088 resque-pool-master: managing [10097, 10100, 10107, 10113, 10117, 10123, 10138, 10160, 10167, 10182, 10195]
10097 resque-1.20.0: Forked 16097 at 1337878130
10100 resque-1.20.0: Forked 16154 at 1337878131
10107 resque-1.20.0: Waiting for cdn_transactions_collection
10113 resque-1.20.0: Waiting for usage_data_collection
10117 resque-1.20.0: Waiting for usage_data_collection
10123 resque-1.20.0: Waiting for check_client_balance
10138 resque-1.20.0: Waiting for check_client_balance
10160 resque-1.20.0: Waiting for geo_location
10167 resque-1.20.0: Forked 16160 at 1337878131
10182 resque-1.20.0: Forked 16163 at 1337878132
10195 resque-1.20.0: Waiting for services_coordination
16097 resque-1.20.0: Processing push_notifications since 1337878130
16163 resque-1.20.0: Processing push_notifications since 1337878132

This is an example from one of our worker servers. The Processing processes are those that are actually working off jobs, so these are the ones we are after since these are the processes that can get “stuck” sometimes for a reason or another. So the script first looks for these processes only, ignoring the rest:

root@worker1:/scripts# ps -eo pid,command | grep [r]esque | grep Processing
18956 resque-1.20.0: Processing push_notifications since 1337878334
19034 resque-1.20.0: Processing push_notifications since 1337878337
19052 resque-1.20.0: Processing usage_data_collection since 1337878338
19061 resque-1.20.0: Processing usage_data_collection since 1337878338
19064 resque-1.20.0: Processing usage_data_collection since 1337878339
19066 resque-1.20.0: Processing usage_data_collection since 1337878339

Next, the script loops through these processes, and looks for those that have been running for over 50 seconds. You may want to change this threshold, but in our case all jobs should usually complete in a few seconds, so if some jobs are still found after almost a minute, something is definitely going on.

ps -eo pid,command |
grep [r]esque |
grep "Processing" |
while read PID COMMAND; do
  if [[ -d /proc/$PID ]]; then
    SECONDS=`expr $(awk -F. '{print $1}' /proc/uptime) - $(expr $(awk '{print $22}' /proc/${PID}/stat) / 100)`

    if [ $SECONDS -gt 50 ]; then
      kill -9 $PID
      ...

      QUEUE=`echo "$COMMAND" | cut -d ' ' -f 3`

      echo "
      The forked child with pid #$PID (queue: $QUEUE) was found stuck for longer than 50 seconds.
      It has now been killed and job(s) flagged as failed as a result have been re-enqueued.

      You may still want to check the Resque Web UI and the status of the workers for problems.
      " | mail -s "Killed stuck Resque job on $(hostname) PID $PID" [email protected]

      ...
    fi
  fi
done

I was looking for a nice and easy way to find out how long (in seconds) a process had been running, and the expression you see in the code snippet above was the nicest solution I could find.

If any of the Resque processes that are working off jobs are found running for longer than 50 seconds, then these are killed without mercy and a notification is sent to some email address just in case.

First, this way we don’t actually kill Resque workers, but other processes forked by the workers in order to process jobs. This means that the workers remain up and running and soon after they’ll fork new processes to work off some other jobs from the queue(s) they are watching. This is the nicest part, in that you don’t need to manually kill the actual workers and then restart them in order to keep the worker servers going.

Second, killing those processes will cause the jobs that they were processing to fail, so they will appear in Resque’s “failed jobs” queue. The second part of the script takes care of this by running a rake task that re-enqueues all failed jobs and clears the failed jobs queue. For starters, you’ll need to add this rake task to your application. If you are already using Resque, you will likely have a lib/tasks/resque.rake file, otherwise you’ll have to create one (I’m assuming here it’s a Rails application).

In any case, add the following task to that rake file:

desc "Retries the failed jobs and clears the current failed jobs queue at the same time"
  task "resque:retry-failed-jobs" => :environment do
    (Resque::Failure.count-1).downto(0).each { |i| Resque::Failure.requeue(i) }; Resque::Failure.clear
  end
end

Back to the script, if it finds and kills any workers that it found stuck, it then proceeds to run the above rake task so to retry the failed jobs:

ps -eo pid,command |
grep [r]esque |
grep "Processing" |
while read PID COMMAND; do
  if [[ -d /proc/$PID ]]; then
    SECONDS=`expr $(awk -F. '{print $1}' /proc/uptime) - $(expr $(awk '{print $22}' /proc/${PID}/stat) / 100)`

    if [ $SECONDS -gt 50 ]; then
      ...
      touch /tmp/retry-failed-resque-jobs
      ...
    fi
  fi
done

if [[ -f /tmp/retry-failed-resque-jobs ]]; then
  /bin/bash -c 'export rvm_path=/usr/local/rvm && export HOME=/home/deploy && . $rvm_path/scripts/rvm && cd /var/www/sites/dashboard/current/ && /usr/local/bin/rvm rvmrc load && RAILS_ENV=production bundle exec rake resque:retry-failed-jobs'
fi

You may notice that I am forcing the loading of RVM before running the rake task; this is because I need to upgrade some stuff on the worker servers, but you may not need to run the rake task this way.

This is basically it: the script just kills the stuck workers and retries the failed jobs without requiring manual intervention; in almost all cases, I don’t have to worry anymore about them besides wondering whether there’s a design issue that might cause workers to get stuck and that therefore need to be addressed (which is a good reason to keep an eye on the notifications). There might be other monitoring solutions of various types out there, but this simple script is what has been working best for me so far on multiple worker servers with tens of workers.

The final step is to ensure that this script runs frequently so to fix problems as soon as they arise. The script is extremely lightweight, so in my case I just schedule it (with cron) to run every minute on each server.

Know of a better way of achieving the same result? Please do let me know in the comments.

© Vito Botta