Syslog woes

About a week ago I was helping some colleagues, from our Ukrainian team, investigate a weird issue they found with a client, following the deployment of a new version of our cloud management software. The problem was that soon after the update the application had started to show a serious degradation of its performance, to the point that even the light load we were seeing at that time of the day couldn’t be handled, despite the beefy boxes and the resources available with that client’s particular setup.

At first, we couldn’t find any evident reason for the issue, since the update brought new features and several fixes but there wasn’t anything, really, that could be linked right away to a so severe degradation of the performance. Plus, as usual the update had undergone some extensive QA for days, before the deployment, so we initially thought it could be something related to that client’s particular setup, once we had ruled out other possible causes in the code.

After quite a bit of investigation, the root cause of the issue turned out to be syslog.

Our cloud management software consists of a Rails 3 web application which is used as a front end to the virtualisation layer, plus a number of separate Ruby daemons and workers responsible for various tasks. So there are several log files that need to be rotated and in some cases copied to some other location to ease support related tasks.

Due to its particular nature, our cloud management software currently has to be self-hosted by our clients, so the team was looking into ways to simplify the setup/update process also by removing some configuration, including -among other things- the custom logrotate configs that had been used until then.

Therefore, they had recently decided to move all the logging to syslog, so to benefit from log rotation without additional logrotate configuration, and features such as the replication/distribution of logs. I do recall a conversation during a recent work trip in which I expressed negative opinion on using syslog for a web application, due to some issues I experienced in the past, but the change made its way into the code and, unfortunately, it led to exactly the same performance issues I had already experienced before.

What is syslog?

I don’t have much experience myself with using syslog for my applications because of the aforementioned performance issues I’ve seen when using it for a couple of past projects, so I haven’t really tried it anymore ever since. So the only “contact” I usually have with syslog is through the logs that are typically configured by various daemons and system processes to be managed with syslog.

However, despite syslog has been around for almost 30 years, one thing that I have noticed often when talking on the subject with others or reading forums and articles, is that there often seems to be some confusion about what syslog is. Often syslog is talked about as a utility, or a library, or an operating system’s component, but syslog is actually not a tool, but a protocol.

In fact, it is a standardised protocol defined in RFC5424.

However, since there are no trademarks or similar obstacles preventing it, the term “syslog” (or others containing it) is often used to name tools using this protocol. There are various implementations of both the syslog protocol and of tools using it, and while most people use free implementations that come with various Unix flavours, there are also some commercial ones which may offer a richer feature set or improved performance.

The “default” implementation on many Linux distributions, for example, is syslogd.

Why you may want to use syslog

Using syslog can have its advantages over the standard logging usually used with Rails applications (or anything that handles its own logging to files, for that matter), including:

  • (as mentioned) it handles log rotation without any additional configuration;
  • it supports replication and distribution of logs to remote systems;
  • makes the separation of concerns between an application’s code and any configuration or setup around it easier;
  • most sysadmins likely prefer having all logging in a single location, to log files scattered through the filesystem;
  • there’s a lot of applications and frameworks with built in syslog support;
  • comes with a command line utility (logger) that makes integrating syslog a snap even with scripts and programs that do not support it natively;
  • recent implementations have support for named pipes, which make it possible for processes to talk to each other and, for example, to react to messages logged to some syslog facility. An example could be some automated notification of logged messages;
  • normal logging to file may bring a system to a halt if disk space runs out, while most implementations of syslog will simply stop logging when this happens;
  • implementations usually come with utilities to compress, backup or remove log files;
  • logs tend to be stored almost always in the same location across systems, so almost always sysadmins know where to find logs right away;
  • syslog makes it easy to add useful information such as the pid and the program name to logged information;
  • some implementations make it easy to search logs depending on the content, with powerful filters;
  • “portability”: some implementation of syslog is available on basically every Unix flavour, and even on other operating systems such as Windows;
  • in production, depending (in particular) on the application server in use and how the load is distributed across instances of your app, you can either log to a separate file for each instance, or you can have all the instances log to the same file, which is likely what happens in most cases. With such configuration, however, there’s a good chance that your logged lines will get interleaved and be basically unreadable or difficult to understand, and therefore, such logs won’t be of much help when investigating issues. There are various logging options that help avoid such issues, with syslog being one of them. You don’t have to use syslog just to avoid it though (I think, for example, that ActiveSupport’s BufferedLogger also solves it), but the possibility to have your logs stored on some remote machine (perhaps a central log server), may be welcome in many scenarios.
  • (depending on how it is used) it can be considered a sort of “thread safe” logging solution, although what makes this possible is also the reason why syslog may be a performance killer in some scenarios.

When you should not use syslog

While syslog provides you with a complete and reliable logging solution, it isn’t the perfect tool for logging needs in every scenario.

With web applications, multithreaded code and any other application that may see high concurrency, it can significantly affect the performance. I am talking here about the “stock” implementation of the syslog daemon (syslogd) available with most Unix flavours, since some implementations (especially a few commercial ones) may perform better in such cases.

Don’t believe me? Try it for yourself. Provided that you simply use syslog as a replacement for the normal Rails logging – not with some dedicated, high performance logging setup – run some benchmarks both with syslog and with the standard logging, and see the difference. You should try a comparison in production mode, and possibly with some decent concurrency to simulate a high traffic application.

In our case, the application had become almost useless in production, with some decent load – which is likely the reason the issue hadn’t been spotted during QA; syslog was perhaps already heavily used by other processes, making it all the worse.

The application, however, “came back to life” once syslog was disabled and the normal logging restored, with no additional changes.

with syslog

Concurrency Level: 5
Time taken for tests: 6.263053 seconds
Complete requests: 10
Failed requests: 0
Write errors: 0
Non-2xx responses: 10
Total transferred: 5130 bytes
HTML transferred: 940 bytes
Requests per second: 1.60 [#/sec] (mean)
Time per request: 3131.527 [ms] (mean)
Time per request: 626.305 [ms] (mean, across all concurrent requests)
Transfer rate: 0.80 [Kbytes/sec] received

with normal logging

Concurrency Level: 5
Time taken for tests: 0.373141 seconds
Complete requests: 10
Failed requests: 0
Write errors: 0
Non-2xx responses: 10
Total transferred: 5130 bytes
HTML transferred: 940 bytes
Requests per second: 26.80 [#/sec] (mean)
Time per request: 186.570 [ms] (mean)
Time per request: 37.314 [ms] (mean, across all concurrent requests)
Transfer rate: 13.40 [Kbytes/sec] received

I ran these simple benchmarks on a single test instance of the application without additional configuration such as caching or load balancing with multiple instances, but these figures should give you an idea of what sort of impact syslog had on our application.

Why such a performance degradation?

First, let me point out -again- that in this particular case, the reason why syslog had been chosen was simply to reduce the configuration and setup overhead with deployments and setups of new clients, since -as mentioned- our cloud management software has to be hosted by our clients. So syslog had only been used purely as replacement for the standard logging we had used until then; we didn’t have a particularly optimised syslog setup with one or more servers dedicated to the collection of logs. That would have likely been a different case, also considering that we could also have used one of those syslog implementations (both free and commercial) that make use of buffering (for example rsyslog with the relp protocol) and other techniques to improve performance while also reducing message loss both when logging to a local disk, and when logging to a remote syslog server. In our case, the standard syslogd was used to log to the local filesystem.

The reason why syslog affected the performance of our application is now quite obvious: by default syslogd will wait for logged messages to be completely written to disk before returning. This kinda serialises the writes to the log for reliability, so to prevent the loss of some messages in the case of a crash. At the same time, this behaviour helps avoid the “interleaved log lines” mentioned earlier.

This happens because the usual syslogd implementation uses fsync() to write a message to a log. This will cause syslog to “commit” each message to disk, which can slow down an application considerably if a lot of messages need to be logged (think of a web application with some decent load and concurrency).

In theory, it it possible to circumvent this issue by using some other implementation of syslog, such as syslog-ng, for example, since it has an option that allows you to specify how many messages to queue up before these messages get actually written to disk. In practice, however, it seems like each of these messages is written to disk with a separate write() call, so it is not clear of the performance benefits of the queue size option.

So a better option perhaps is to disable fsync altogether, as suggested in an article by IBM, provided you can sacrifice some reliability. In many cases, it is possible to disable fsync with the standard syslogd by just prepending a “-” to the log’s filename, e.g.:

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none -/var/log/messages
# The authpriv file has restricted access.
authpriv.* -/var/log/secure
# Log all the mail messages in one place.
mail.* -/var/log/maillog

While this may reduce reliability, it may even save your hard drives! I’ve often read stories in the past of people who have literally killed their hard drives by just using syslog with fsync for some heavy logging!
If your apps have some heavy logging to do, and you can’t afford some more expensive setup, then you really should forget about syslog, because normally it doesn’t scale well if used with high traffic web applications. It’s best used for things like logging done by daemons and system components that do not suffer from the load that a web app with decent traffic can have.

There is another reason why you may want to think twice before using syslog for your Rails apps’ logging: message packets are limited to 1024 bytes. This may often be a problem, since with a web application it is not too difficult to have the app log messages larger than 1K, even in production, and this means that those messages would have to be fragmented to be logged properly, further worsening the performance.

The bottom line of this post is, as you may have guessed: think twice before using syslog with web apps!