Resque: automatically kill stuck workers and retry failed jobs

resque-automatically-kill-stuck-workers-retry-failed-jobs

Resque is a great piece of software by Github that makes it really easy to perform some operations (‘jobs’) asynchronously and in a distributed way across any number of workers. It’s written in Ruby and backed by the uber cool Redis key-value data store, so it’s efficient and scalable. I’ve been using Resque in production for a couple years now after it replaced Delayed Job in my projects, and love it. If your projects do something that could be done asynchronously, your really should check it out if you haven’t yet.

At OnApp we’ve been using Resque for a while to process background jobs of various types, with great results: in a few months, we’ve processed a little over 160 million jobs (at the moment of this writing), and out of this many only 43K jobs have been counted as failed so far. However, many of these failed jobs have been retried successfully at a successive attempt, so the number of jobs that actually failed is a lot smaller, perhaps a very few thousands.

Out of 160M+ jobs, it’s a very small percentage of failures. But despite the system, for the most part, has been rock solid so far, jobs can still fail every now and then depending on the nature of the jobs, excessive load on the worker servers, temporary networking and timeout issues or design related issues such as race conditions and alike. Sometimes, you will also find that workers can get “stuck”, requiring (usually) manually intervention (as in: kill / restart the workers, manually sort out failed jobs).

So I wanted to share a simple script I am using in production to automatically find and kill these “stuck” workers and then retry any jobs that are found as ‘failed’ due to the workers having been killed, or else. The purpose is to keep workers running and minimise the need for manual intervention when something goes wrong.

Please note that I use resque-pool to manage a pool of workers more efficiently on each worker server. Therefore if you manage your workers in a different way, you may need to adapt the script to your configuration.

You can find the little script in this gist, but I’ll briefly explain here how it works. It’s very simple, really. First, the script looks for the processes that are actually working off jobs:

This is an example from one of our worker servers. The Processing processes are those that are actually working off jobs, so these are the ones we are after since these are the processes that can get “stuck” sometimes for a reason or another. So the script first looks for these processes only, ignoring the rest:

Next, the script loops through these processes, and looks for those that have been running for over 50 seconds. You may want to change this threshold, but in our case all jobs should usually complete in a few seconds, so if some jobs are still found after almost a minute, something is definitely going on.

I was looking for a nice and easy way to find out how long (in seconds) a process had been running, and the expression you see in the code snippet above was the nicest solution I could find (hat tip to joseph for this).

If any of the Resque processes that are working off jobs are found running for longer than 50 seconds, then these are killed without mercy and a notification is sent to some email address just in case.

First, this way we don’t actually kill Resque workers, but other processes forked by the workers in order to process jobs. This means that the workers remain up and running and soon after they’ll fork new processes to work off some other jobs from the queue(s) they are watching. This is the nicest part, in that you don’t need to manually kill the actual workers and then restart them in order to keep the worker servers going.

Second, killing those processes will cause the jobs that they were processing to fail, so they will appear in Resque’s “failed jobs” queue. The second part of the script takes care of this by running a rake task that re-enqueues all failed jobs and clears the failed jobs queue. For starters, you’ll need to add this rake task to your application. If you are already using Resque, you will likely have a lib/tasks/resque.rake file, otherwise you’ll have to create one (I’m assuming here it’s a Rails application).

In any case, add the following task to that rake file:

Back to the script, if it finds and kills any workers that it found stuck, it then proceeds to run the above rake task so to retry the failed jobs:

You may notice that I am forcing the loading of RVM before running the rake task; this is because I need to upgrade some stuff on the worker servers, but you may not need to run the rake task this way.

This is basically it: the script just kills the stuck workers and retries the failed jobs without requiring manual intervention; in almost all cases, I don’t have to worry anymore about them besides wondering whether there’s a design issue that might cause workers to get stuck and that therefore need to be addressed (which is a good reason to keep an eye on the notifications). There might be other monitoring solutions of various types out there, but this simple script is what has been working best for me so far on multiple worker servers with tens of workers.

The final step is to ensure that this script runs frequently so to fix problems as soon as they arise. The script is extremely lightweight, so in my case I just schedule it (with cron) to run every minute on each server.

Know of a better way of achieving the same result? Please do let me know in the comments.




Have your say!

Please see my comment policy if this is your first time here or if you have any questions regarding comments.