Thanks Google for being so (un)helpful

I usually tend to be a big fan of Google: I use many of their services all the time, I love the innovation they endlessly bring to the Web and the technology in general, and -while I could hardly believe in the “Don’t be evil” joke- “the Google story” has to be one of my favourites when it comes to reading stories of small, garage-born startups becoming extremely successful companies.

Among other things, what has always impressed me of their services is the reliability, together with the excellent performance and the minimalistic but functional design, as well as the great integration across their products. But lately I have been adversely surprised by a few things, that got me wondering what’s going on with Google, and whether something has been changing lately.

First, there’s Gmail. Once upon a time, I used to use the web interface a lot, especially since I moved all my email accounts and domains over to Google Apps a few years back and started using the Docs & Calendar apps as well; it was innovative, with the threaded view and the labels in place of folders, making it easier and less overwhelming to process the inbox and organise messages in a more meaningful way. It already had the best spam filters around and, coming from Google, it naturally had the best search experience ever: you could just archive all your mail and perhaps not even bother organising it, knowing that you could find any message quickly with just a few relevant keywords. And then, also thanks to a new ajaxy UI it was fast! Very fast, faster than any other mail service I had tried before.

Over the past year, I have been using a desktop client rather than the web interface to access my email accounts, so I haven’t really been up-to-date on Gmail’s development and on new features. That was until I read a few blog posts on the new Priority Inbox feature, and I decided that I had to give it a try. With the desktop client of choice (Apple Mail) I was tired of having to re-scan my inbox often and manually filter out useless crap to remain with a few important messages worth reading, plus other messages I wanted to read later (thus the need to frequently re-scan the inbox). I am now back to using the web interface full time, and am loving the Priority Inbox combined with third-party add-on OtherInbox, so it looks like my inbox is now behaving as it should.

But I can’t say I am loving the performance though! I can see that Gmail has grown up a lot in the meantime, and it now has lots of interesting Labs features, but while the spam filters and the search experience are still great, the whole application has become a lot slower than it used to be. How come? It looks like I am not alone on this, as the web is full of users complaining about the performance of the service, with some saying Gmail has become unusable for them, and others even suggesting to revert to the HTML mode instead… which means giving up all but the essential features of a web-based email client. A quick search also shows that Google have been aware of this issue for quite some time now, but while they promised to be on it and that it would be fixed soon, it doesn’t look to me like Gmail is once again as fast as it used to be.

Gmail’s performance issues aren’t the only thing that has been disappointing me lately. A few days ago, I finally took the time to find out how to migrate to my main Google Apps account all my data associated with the various Google services (such as Analytics, Feedburner) that were previously unavailable to Google Apps accounts, which is the reason why so far I had to keep a separate “regular” Google account for these services alone. I was happy to learn that I could now merge these accounts, as it kind of sucked having to keep separate accounts with the same company but … OMG! Google couldn’t have made it more difficult! Not only was I surprised to see that not all the services can be migrated from a regular Google account to a Google Apps account, what’s more… almost each service that can be migrated requires a different procedure! This is surely not the great integration I meant above. The way Google has been handling these different types of accounts for this long is IMHO just plain ridiculous. As if this were not enough, it took me a while to find some relevant information on the matter on Google Apps help pages, which in most cases are unbelievably unhelpful.

Cherry on the cake, earlier today I was trying to access the Analytics profiles that I had migrated yesterday from my regular Google account to my main Google Apps account. It worked fine all of yesterday, but today it decided to greet me with this:

(image no longer available)

Of course, no additional information or helpful explanation of what caused the error, or what steps I could take to fix it, if there’s anything I’ve got to do. Looks so much Microsoft, doesn’t it?

After a quick search, the only mention I could find about a possible fix was on this page, in which a Google employee suggests to visit another page (LOL, a Google Docs spreadsheet), where I am supposed to enter the email address with the promise that the issue will be fixed within 24 hours. It smells weird, but the user is listed as “Google employee” on the help pages so -hopefully- it shouldn’t be anything spammy.

I haven’t tried yet -I will in a minute- but regardless of whether or not this will actually fix the problem, I haven’t yet found the reason behind it (did something go wrong during the migration? if yes, why did Analytics work fine yesterday?) and generally speaking, Google help pages and support are surprisingly useless or, in the best cases, pretty difficult to navigate to find the relevant information one’s looking for.

Google may be acquiring tons of companies every year, but I don’t think they are going in the right direction with regards to products integration and customer support. C’mon, Google, fix it!

Update: Something wrong really is going on with Google! Not only hasn’t the issue with the Analytics profiles been fixed yet, tonight I went into my Feedburner account to change some settings and my feed had gone! The weird thing is, the feed itself still seems to be working, but the “My feeds” list in Feedburner is empty. I am pretty sure this still has to do with the migration – I also transferred the feed settings over to my Google Apps account.

Arghhhh!!!!

Using Google Translate from the terminal

(Update Jan 14, 2011: If you have already used this tip and are back to this post because it’s no longer working, read on… Google have updated the Translate API v2, so I have made some changes to the end of the post accordingly.)

In a previous post, I showed a nice shortcut I use quite often to search for definitions, with Google‘s define search feature, from the command line rather than from within a browser.

As I always have a few terminal windows open at any time, I often look for ways of using some popular web services from the command line as this can be fun and save time too; nowadays many web services having a UI that can be consumed with a browser, also expose APIs that allow developers to integrate these web services in their applications or web mashups.

Google, in particular, offer APIs for just about every one of their web services, including the very popular Google Translate, which I also use a lot to translate mostly from and to English, Finnish, Italian. So, how can we use Google Translate from the command line?

In the previous example, we have seen how easy it is to fetch any kind of web page from the command line by using utilities such as wget or similar; we’ve also seen how we can manipulate and format the content returned, for example to adapt it for display in a terminal. We could do something similar with Google Translate, however there is a quicker and better way to achieve the same result, by using the Google Translate API. This API can be consumed with usual HTTP requests, but returns a JSON response rather than a normal HTML web page, as it is designed to be integrated by developers in other applications and services.

We could manipulate this JSON response once again with filters based on regular expressions as seen in the previous example, but there is an easy way of parsing this JSON directly from the command line, with a utility called jsawk, that works in a very similar way to awk but specifically with JSON-formatted text.

The first step naturally is to make a request to the Google Translate API; in the documentation, we can see that it is possible to consume the API in two ways: either with some JavaScript code, or by making a REST request directly to the service and get in return the result in JSON format almost ready to use – this is what we are going to do.

First of all, you’ll need a Google Account -if you don’t own an account already, create one- and you’ll also need to request an API key to be able to consume the service. You can request a key for free here.
The documentation tells us that we should issue requests with URLs in the format

https://www.googleapis.com/language/translate/v2?key=YOUR KEY&q=TEXT TO TRANSLATE&source=SOURCE LANGUAGE CODE&target=TARGET LANGUAGE CODE

So for example, to translate the word “hello” from English to French, with wget we can issue a request like

wget -qO- --user-agent firefox "https://www.googleapis.com/language/translate/v2?key=YOUR KEY&q=hello&source=en&target=fr"

As for the previous example, we need to specify a user agent otherwise Google will return an empty response. You should see this result:

{
  "data": {
    "translations": [
      {
        "translatedText": "bonjour"
      }
    ]
  }
}

which is a JSON response, as expected. Next step is to parse this response and get the value of the translatedText property in the nested object data->translations.

However, jsawk seems to expect a JSON array of objects, therefore we’ll need to first manipulate this response and wrap it into [ ] brackets to obtain an array with a single item.

echo "[`wget -qO- --user-agent firefox \"https://www.googleapis.com/language/translate/v2?key=YOUR KEY&q=hello&source=en&target=fr\"`]"

the JSON response becomes:

[ {
  "data": {
    "translations": [
      {
        "translatedText": "bonjour"
      }
    ]
  }
} ]

and is now ready to be parsed by jsawk. We need to get the value of the property data.translations[0].translatedText of the first item of the array (as you can see, data.translations is also an array):

echo "[`wget -qO- --user-agent firefox \"https://www.googleapis.com/language/translate/v2?key=YOUR KEY&q=hello&source=en&target=fr\"`]"  \
| jsawk -a "return this[0].data.translations[0].translatedText"

You should see just the word “bonjour” rather than the JSON response. One last step we have already seen in the previous example with the define search feature, is to make sure any HTML entities in the translated text can be correctly displayed in the terminal:

echo "[`wget -qO- --user-agent firefox \"https://www.googleapis.com/language/translate/v2?key=YOUR KEY&q=hello&source=en&target=fr\"`]" \
| jsawk -a "return this[0].data.translations[0].translatedText" \
| perl -MHTML::Entities -pe 'decode_entities($_)'

At this point, you can automate this command by wrapping it within a shell function that accepts as arguments the source language ($1), the target language ($2) and the text you want to translate ($3):

translate() {
    echo "[`wget -qO- --user-agent firefox \"https://www.googleapis.com/language/translate/v2?key=YOUR KEY&q=$3&source=$1&target=$2\"`]" \
    | jsawk -a "return this[0].data.translations[0].translatedText" \
    | perl -MHTML::Entities -pe 'decode_entities($_)'
}

so you can use this function like this:

translate en fr "hello"
=> bonjour

You may also set up some aliases for the couples of languages you translate from/to the most:

alias en2fr='translate en fr "$@"'

So the example for English->French translation simply becomes:

en2fr "hello"
=> bonjour

As seen in this last example, remember to wrap the text to translate within double quotes if you are translating a phrase rather than a single word.

Now, to test that all works, guess what “Hyvää Joulua kaikille” means, in Finnish. 🙂

Update Jan 14, 2011: I noticed today that the trick I described in this post was no longer working as it was; it looks like Google have updated the version 2 of the Translate API, which is the version I have used in the commands above. I hadn’t noticed, actually, that this version was still a “lab” version and not yet a release, as highlighted in the documentation:

Important: This version of the Google Translate API is in Labs, and its features might change unexpectedly until it graduates.

Funnily enough, the documentation itself doesn’t yet reflect some changes they’ve already made to the API. In particular, a request made to the same URL,

wget -qO- --user-agent firefox "https://www.googleapis.com/language/translate/v2?key=YOUR KEY&q=hello&source=en&target=fr"

now yields a JSON response in a slightly different format:

{
 "data": {
  "translations": [
   {
    "translated_text": "bonjour"
   }
  ]
 }
}

As you can see, the array’s gone and they’ve renamed the translatedText property to translated_text.So the new version of the translate function, still using the Google Translate API v2, would be:

translate() {
    wget -qO- --user-agent firefox "https://www.googleapis.com/language/translate/v2?key=YOUR KEY&q=$3&source=$1&target=$2" \
    | jsawk  "return this.data.translations[0].translated_text" \
    | perl -MHTML::Entities -pe 'decode_entities($_)'
}

which is also a little bit easier. However, since they’ve made it clear that the API may still change while in the labs, it’s perhaps more convenient to stick to the Google Translate API v1 in the meantime – the result, in the end, should be the same. So the translate function, using v1 instead according to its documentation, becomes:

translate() {
    wget -qO- --user-agent firefox "https://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=$3&langpair=$1|$2" \
    | jsawk "return this.responseData.translatedText" \
    | perl -MHTML::Entities -pe 'decode_entities($_)'
}

It doesn’t look to me like this version requires a key, as it seems to working just fine without any. Quick test:
Ruby

en2fr 'Thanks, Google!'
Merci, Google!

Protect your WordPress blog’s administration from prying eyes

If you also have a blog powered by WordPress, and care at least a bit about keeping it secure, you may have thought about preventing others from accessing its administration pages.

The reason, of course, is that about every WordPress user is aware of the default location of both its administration area (the famous wp-admin folder), and of the login page, therefore it is likely that sooner or later someone will try to gain unauthorised access to your blog, in a way or another, and have some fun. WordPress, historically, has never been too solid from a security point of view, plus -out of the box- it does not lock down a user or IP address after a reasonable number of failed login attempts, as it should; so unless you have a very strong password, it may not take too long for a malicious, patient visitor to figure out your password (especially since Amazon announced the availability of reasonably priced Cluster GPU instances on their EC2 platform…)

So, unless you really have to let your visitors be able to register and login on your blog, it is advisable that you also keep anyone else away from the login page altogether. This may be possible or not depending on the reason why your site needs your users to login locally to perform some actions, but -luckily- there are almost always workarounds. In most cases, in fact, this has only to do with enabling your users to leave comments on the site. If this is true for you, too, I recommend you think about “outsourcing” the management of your comments to third-party services, the most popular ones being IntenseDebate (by WordPress.com parent company Automattic) and, in particular, Disqus – which also happens to be my favourite one these days, and that therefore I recommend.

(Note: if just the though of having your blog’s comments stored elsewhere, by someone else, makes you shiver, trust me: there is nothing to worry about, but only to gain. First, most of these services are free -including the aforementioned two most popular ones- so it won’t cost you a penny. Second, they integrate pretty well with WordPress in such a way that your comments are actually synchronised between your blog and the remote service, so you always have a local copy of all your blog’s comments. Third, by outsourcing your comments you will be able to relieve the load on your own servers quite a bit, and improve the cacheability of your site: if your blog caches the dynamic pages to static HTML files that can be served more quickly, normally -depending on the caching solution you use- each time a user leaves a comment the cached copy of the same page must be rebuilt as the changes required to display the new comment invalidate the cache; by using Disqus -or similar- instead, a remote JavaScript call to the third-party servers fetches the comments to render on the page, therefore the durability of your cached pages is significantly higher. Fourth, since these services already integrate anti spam solutions, you may also disable Akismet and other plugins on your blog, and you know that the less plugins you install, the lighter and faster your WordPress blog will be. Fifth, these services just work better than WordPress’ built in comments, have interesting social features, and even help improve the SEO performance of your site in a number of ways. What more could you ask for?)

Back in topic, assuming you have by now removed any need for your users to login locally on your site, there are several ways of protecting your site’s login page and and administration area. To begin with, you could customise your WordPress copy and change the location and names of the files and folders involved in the login and administration, or change altogether the way WordPress handles authentication and authorisation on your site. But this could be pretty time consuming, and would make it a lot more difficult to upgrade your WordPress copy each time a new version is released.

A lot easier and quicker is to use plugins which add this sort of functionality to WordPress, without your having to fiddle with the core files, and without compromising upgrades. But as said, the more plugins you install, the slower your blog will be. Also, many of the plugins available at the moment to secure your blog, either don’t work as expected with the latest release of WordPress, or do not work at all. Among the few ones that seem to work, I have seen some that operate by basically hiding in a way or another the wp-admin folder as well as the login page. Unfortunately, while these seem to work, it is often possible to break their functionality and bypass the protection they offer.

My advice is not to waste too much time with these plugins, either. If you want a way of securing the administration of your blog that is simple, effective, and doesn’t slow down your blog or require plugins or any heavy customisation, my recommendation is that you simply restrict access to the reserved area(s) by IP address at web server level. This just works, and because the restriction happens without even involving WordPress at all, it’s very efficient in terms of resources as well, and will perform better in case of brute force attack that floods your site with login attemps perhaps leading to a sort of DoS. With the tecnique I suggest here, in fact, visitors trying to access reserved folders and pages from unauthorised IP addresses are instead shown a 404 page that can also be cached statically as well as the rest of the site. Therefore, with this sort of configuration a light and fast web server such as Nginx (which I strongly recommend over heavier ones like Apache) is already able to serve thousands of requests per second with significant levels of concurrency, even on a cheap VPS. In case of smaller DoS or brute force attempts, this configuration may help.

Restricting access to an application by IP is easy with any web server; here is an example with Nginx, but it shouldn’t take long to adapt these rules to another web server (make sure you add the following lines within the server section of your virtual host configuration, and not within a location block):

server {
    ...

    set $noadmin 1;

    if ($remote_addr = "xxx.xxx.xxx.xxx")   { set $noadmin 0; }
    if ($remote_addr = "yyy.yyy.yyy.yyy")   { set $noadmin 0; }

    if ($noadmin = 1) {
        rewrite ^/wp-admin/(.*)$ /index.php?q=$1 last;
        rewrite ^/wp-([^/]*?).php(.*)$ /index.php?q=$1 last;
    }

    ...
}

First, we set a flag (a custom variable in nginx) to true, meaning we’ll restrict access to the admin pages by default. Then, for each of the IP addresses we want to grant access to, we set that flag to false (nginx doesn’t support nested or combined conditions AFAIK, therefore I am using multiple lines here). Lastly, if the flag remains set to true -meaning the request comes from an unauthorised IP address- nginx rewrites the routes for the reserved folders and pages (you can customise these as you wish) in such a way that while the URL in the browser won’t change, WordPress’ 404 page is returned instead, as if those folders or pages did not exist. We do this by simulating a WordPress search, using the name of the folder or page requested as query. In the second rewrite rule, we basically block any PHP page starting with wp (thus also wp-login.php), for improved security: all the WordPress PHP files matching this naming convention, in fact, are not supposed to be requested directly from a browser.

Update: reader Dave Ross pointed out in the comments that admin-ajax.php should be excluded by the rules above since it is required by some plugins for AJAX functionality – Thanks Dave, I usually avoid anyway using AJAX based plugins unless I am totally sure they do not compromise in any way the security of a site, but yours was actually a very good point.

It’s easy to exclude admin-ajax.php by adding a negative lookahead to the regular expression in the first rule:

if ($noadmin = 1) {
  rewrite ^/wp-admin/((?!admin-ajax\.php).*)$ /index.php?q=$1 last;
  rewrite ^/wp-([^/]*?).php(.*)$ /index.php?q=$1 last;
}

This way the only requests that will be allowed within /wp-admin regardless of the IP address are the requests made for that particular file.

I like this trick because it’s simple, light on WordPress and.. it just works. At this point, though, you’ve surely realised that this means you will only be able to access the administration of your blog (thus to write posts) while connected from IP addresses you know. So what about Internet cafes, a workplace, and so on? My advice here is to overcome this limitation by using a VPN or, even easier, an SSH tunnel and route all the traffic through it. First, all your traffic will be encrypted -which is something you should always care about when you connect from other Internet connections than yours- and therefore even an SSL certificate is no longer strictly required to access the reserved areas of your site; second, because you will be routing all your Internet traffic through your VPN/SSH gateway, the IP address your site will see is the IP address of your gateway, and not that of whichever Internet connection you are using at the moment. And you do know the IP address of your gateway, therefore you can configure the rules shown above to authorise that IP address.

I usually set the rules above to allow my home IP address as well as those of my servers and, by default, I always route all the traffic through one of these servers with an SSH tunnel when I am not at home. To do this easily, if you are on Mac I recommend the combination of the utilities NetworkLocation and Meerkat, so you won’t have to manually change SSH tunnel depending on where you are.

Happy birthday, Rails!

Though few may recall it, today marks the fifth birthday of one of the most successful web development frameworks ever: Ruby on Rails. Its first official release, v1.0, was in fact released on December 13th, 2005.

RoR” isn’t just one of the many frameworks out there; created by Danish developer David Heinemeier Hansson while working on Basecamp37signal‘s popular project management web app- it has managed to revolutionise in a relatively short time both development patterns and habits leading to levels of productivity unknown before.

Loved by agile startups who need to build new web based products quickly and with tight budgets, ignored for some time by those many old minded companies that wouldn’t easily adopt open source technologies anyway, RoR is today a wildly popular framework that even many of those old minded companies are starting to adopt more and more often for projects of any level and size.

So, why is Ruby on Rails so hot? What makes it so successful? The main reason, as said, is increased productivity, thanks to a nice implementation of the MVC pattern, a convention over configuration approach, developer-friendly handling of database migrations, and a lot more. As its home page suggests, Rails is

“optimized for programmer’s happiness and sustainable productivity”

It’s also worth remembering creator Hansson’s own words in an interview on Rails’ success back in 2005:

“I thought, if this stuff can improve my enjoyment and productivity so much over the conventional approaches I had used in the past, why shouldn’t it be able to do so for others?”

In many ways, RoR has also managed to lower the barriers for newbie programmers, although this also means that nowadays there are many self styled “developers” in the market who, in truth, have little understanding of -or experience with- serious web development and all that this implies (infrastructure, security, database maintenance, scalability and so on). But this applies to any platform or framework, anyway.

Forever, people have had different views on how to approach problems and on how to solve them. This is especially true with technology, therefore it shouldn’t surprise that even Rails hasn’t been free from criticism either, during its short, but successful life. There have always been fans, but also detractors always ready to complain or adversely criticise something, often perhaps without even having tried the framework themselves. I can recall many flames on topics such as Ruby on Rails vs the rest of the world, as well as many others in which many contenders often forgot or just ignored the difference that exists between the language (Ruby) and the framework (Rails), thus often blaming Rails for something that should actually have been attributed to Ruby. Even today, despite a lot has changed in the meantime and even the wars between Rails fans and others (Django fans especially!) have calmed down, there still are those who just can’t get themselves to like Rails for the reasons us others have learnt to love it. Amazingly, there are even websites like IHateRubyOnRails.com which I came across by chance earlier today (in case you may be interested, ILoveRubyOnRails.com is still available, btw). Rails has also suffered in the past some bad, often totally wrong publicity, as was the case for the Twitter incidents on scalability, but luckily for us all this has never really affected Rails’ own development, which instead has taken off also thanks to the contribution of a huge community, and has recently hit another important milestone with a third release packed with improvements and new features.

I have myself worked or experimented (depending on the case) a lot with many technologies, languages and frameworks, and while I am aware of limitations of both Ruby and Rails, I love working with them for most projects and am particularly passionate about them. Since I started toying with Ruby and Rails (early 2006), I have had to work with other technologies too, because of work-related constraints (you could still sell .NET more easily to businesses, for example); while I am not really the kind of developer who would stick to this or that language or framework, these days the market is a lot more open to Ruby and Rails than it used to be, and am definitely enjoying working with them most of the time now.

So, thanks a lot to David for starting it all, and again happy birthday, Rails!

Using Google’s ‘define’ search feature from your terminal

(Update 05/04/2012: Google have slightly changed the format of URLs for search, so I have updated the snippets to take this and other small changes into account)

Since I started writing content for this blog, and as English is not my mother tongue, I find myself using quite often various tools that help me either chose the right word for something I am trying to communicate, or check that the syntax of a sentence is correct, so that the content is readable enough.

One of such tools is -unsurprisingly- Google: by searching for two different terms or phrases within the double quotes (what Google calls “phrase search“), I can see which one yields most results and therefore is more likely to be correct English. But even more useful is the define search feature: by prepending the text “define:” to your search term or query, you can instruct Google to search for and return directly definitions from various sources for that term or phrase, rather than a bunch of links.

I have been using define a lot lately, but at some point I got a bit tired of opening a new browser tab or window each time I had to double check the definition for a word (too much energy, you know…), so I have been toying with a little hack that now lets me use the same feature from within the terminal much more quickly, given that I always have at least one or two terminals open at any time.

There are a few command line utilities you can use to fetch web pages, with wget being one of the most popular. To fetch for example the definitions for the word “blog” from Google define using wget, all what you need to do is type a command like the following:

wget -qO- http://www.google.co.uk/search\?q\=blog\&tbs\=dfn:1

where the option “-qO-” simply tells wget to output the content of the page downloaded directly to screen (or STDOUT) rather than to file. You’ll notice that wget seems to be performing the request as expected, however it shows no output. This is because -it seems- a user agent is required. So let’s try again specifying a user agent such as “Firefox”:

wget -qO- -U "Mozilla/6.0 (Macintosh; I; Intel Mac OS X 11_7_9; de-LI; rv:1.9b4) Gecko/2012010317 Firefox/10.0a4" http://www.google.co.uk/search\?q\=blog\&tbs\=dfn:1

You should now see the HTML of the page as a browser would see it. Problem is, this is not really readable, is it? Next step is to strip all the html tags so that we can only preserve the actual content we are looking for: the definitions for our search term or phrase. We can do this easily by processing the HTML with grep and instructing it to only return li HTML elements since -you can check in the HTML- the li elements in the page correspond to the various definitions returned for your search query.

wget -qO- -U "Mozilla/6.0 (Macintosh; I; Intel Mac OS X 11_7_9; de-LI; rv:1.9b4) Gecko/2012010317 Firefox/10.0a4" http://www.google.co.uk/search\?q\=blog\&tbs\=dfn:1 \
| grep --perl-regexp --only-matching '(?<=
<li style="list-style:none">)[^<]+'

In the pipe above, we tell grep to process wget’s output and use the regular expression provided as argument to return only the parts of each matching line that match the pattern, that is in -in this case- all the li elements present in the page returned by Google. If you try the command above you will now see an output similar to the following for the word “blog”:

read, write, or edit a shared on-line journal
web log: a shared on-line journal where people can post diary entries about their personal experiences and hobbies; "postings on a blog are usually in chronological order"
A blog (a contraction of the term "web log") is a type of website, usually maintained by an individual with regular entries of commentary, descriptions of events, or other material such as graphics or video. Entries are commonly displayed in reverse-chronological order. ...
website that allows users to reflect, share opinions, and discuss various topics in the form of an online journal while readers may comment on posts. ...
blogger - a person who keeps and updates a blog
(cut)

This is a lot better, but we can still improve it further by adding line numbers (with the command nl) and making sure that HTML entities, if any, are displayed correctly in the terminal (we are not using a browser, after all). This can be done by using once again perl and in particular it’s decode_entities() method:

wget -qO- -U "Mozilla/6.0 (Macintosh; I; Intel Mac OS X 11_7_9; de-LI; rv:1.9b4) Gecko/2012010317 Firefox/10.0a4" http://www.google.co.uk/search\?q\=blog\&tbs\=dfn:1 \
| grep --perl-regexp --only-matching '(?<=
<li style="list-style:none">)[^<]+' \
| nl | perl -MHTML::Entities -pe 'decode_entities($_)'

You should now see a more readable output similar to the following:

1   read, write, or edit a shared on-line journal
2   web log: a shared on-line journal where people can post diary entries about their personal experiences and hobbies; "postings on a blog are usually in chronological order"
3   A blog (a contraction of the term "web log") is a type of website, usually maintained by an individual with regular entries of commentary, descriptions of events, or other material such as graphics or video. Entries are commonly displayed in reverse-chronological order. ...
4    website that allows users to reflect, share opinions, and discuss various topics in the form of an online journal while readers may comment on posts. ...
5   blogger - a person who keeps and updates a blog
(cut)

Now edit your .bash_profile file (or equivalent for the shell you use – if different than bash, you may have to adapt slightly the code) and add this function:

define() {
    wget -qO- -U "Mozilla/6.0 (Macintosh; I; Intel Mac OS X 11_7_9; de-LI; rv:1.9b4) Gecko/2012010317 Firefox/10.0a4" http://www.google.co.uk/search\?q\=$@\&tbs\=dfn:1 \
    | grep -Po '(?<=
<li style="list-style:none">)[^<]+' \     | nl \     | perl -MHTML::Entities -pe 'decode_entities($_)' 2>/dev/null;
}

Then, to finally use the trick from your terminal, all you have to do is enter a command like:

define blog

I love this kind of tricks as they make the use of our dear terminal even more productive. I am sure other Google search features -as well as other web services- can be as useful when consumed from the terminal; we’ll have a look at some more examples later on.

Faster Internet browsing with alternative DNS servers and a local cache

It is no secret to power Internet users that DNS resolution is one of the factors that mostly affect the performance of our Internet browsing, and sometimes even a very fast broadband can become a pain if a poor DNS service is used. DNS -which stands for “Domain Name System“- is a protocol which makes use of a networked database, plus a particular set of services for querying that database, with the main function of translating human friendly, symbolic hostnames such as “www.google.com”, into the numerical addresses or IP‘s of the machines hosting a website or service accessible from the Internet or, generally speaking, a typical network. Fast DNS servers, usually, make for a better user experience thanks to a faster Internet browsing, even with today’s fast broadband services. With today’s media-rich websites, social networks and content mashups, in fact, each time a web page is downloaded chances are that the page contains references to images or other content hosted on several different hosts, and therefore accessible from different hostnames / domain names. While sometimes developers may make this happen in purpose so that browsers can benefit from parallel downloading (see Steve Souders‘ famous 14 rules for faster loading web sites), since each hostname requires a trip to the DNS server, already just a few different hostnames can negatively affect the overall page loading time if a low performing DNS server is used.

ISP’s usually offer their own free DNS together with their Internet connectivity service; this is what most people normally use, and in many cases this may be just OK. However, the DNS service offered by ISP’s is often poor when it comes to performance. Luckily, nowadays there are quite a few alternative, more specialised DNS services which are also freely available and that usually offer either improved performance (thanks to smarter caching and a generally better setup/design) or additional features that go beyond the simple DNS resolution but that make use of DNS -most importantly, improved security with filters at DNS level that prevent users from reaching known malicious sites, protecting them against phishing and other threats . Some of these free services only promise to deliver better performance, such as Google Public DNS and DNS Advantage by UltraDNS, while others – such as Norton DNS or Comodo Secure DNS focus mainly on the security benefits of having an active filtering at DNS level.

Then, among the more popular ones, there is also OpenDNS, that does it all. This was likely the first specialised DNS service and remained basically the only one of its kind for a while, until several others spotted the significant potential of services based on DNS as well as new revenue opportunities. OpenDNS and others offer most of their services for free, while making money with “premium” services with additional features, as well as through NXDOMAIN hijacking: when a request is made for an unresolvable domain name, the browser won’t show the usual, expected error message; instead, as OpenDNS intercepts and processes the requests, it detects that the domain name cannot be resolved (or that the actual website isn’t loading at the time) and by default redirects the user to their OpenDNS Guide page with search results based on the mistyped or wrong domain name… plus some lovely ads, of course. This is a somewhat clever trick that makes them some decent money and therefore it shouldn’t surprise that others have copied it, competitors as well as many ISP’s who can’t miss the opportunity to make some more money the easy way. However this approach by ISP’s has often been criticised, since while OpenDNS makes it pretty clear how they make money out of their free DNS service, most others do not.

Out of the several DNS services around these days, OpenDNS still remains the most feature-rich of them all. Not only it still offers the best DNS performance in many locations (as is the case for me in London, UK), it also offers quite many other features that can be particularly useful when managing a network of computers or if you have small kids, thanks to security filtering and parental control. I recommend you to create an account (it’s free!) and configure the appropriate settings for your network(s) through their excellent dashboard, if you are interested in these features.

opendns-png-1c6b9a

Regardless of which DNS server you use, it is still possible to improve your DNS experience, thus your Internet browsing, a further bit by setting up a local DNS server to use as a cache. Some may argue that most operating systems and browsers already cache DNS query results locally, but while this is true, I have found that a local DNS server used as a cache still helps improve things, especially, of course, if this cache can also be shared with other clients in the same network (at home I use a local DNS cache as well as a caching proxy, Squid, to improve the overall browsing performance of my home clients).

Setting up a local DNS server is pretty easy and quick on Unix systems, provided you are familiar with the terminal. Here we’ll see how to do this on Snow Leopard, but it shouldn’t be too different on other Unix flavours once you have installed the DNS server we are going to use, BIND, with your package manager of choice (for Windows desktops, there was once upon a time a simple DNS server called TreeWalk DNS, but the project seems to have been abandoned years ago and the website is currently listed as malicious by Norton for some reason).

BIND is already installed on Mac OS X, although it is switched off by default. For starters, to prevent users on remote systems form being able to access and control our local BIND server, we need to make sure that a secret key is used to explicitly grant privileges to a host. BIND requires a daemon called named to be running on the system, while the utility rndc will take care of the administration of this daemon with commands that will only work if the keys specified in the two configuration files /etc/named.conf and /etc/rndc.conf match.

It is possible to automate this little configuration and create a key file by executing the command


sudo rndc-confgen -a

Since BIND expects the key to be in /etc while the command above creates the key in /private/etc (at least on Snow Leopard 10.6.5), you can either


sudo mv /private/etc/rndc.key /etc/rndc.key

or


sudo vim /etc/named.conf  

and change the line include “/etc/rndc.key”; to include “/private/etc/rndc.key”;

Update: as reader David Glover reminds me in the comments, there is no need to move the file /private/etc/rndc.key to /etc/rndc.key since /etc is already a symlink to /private/etc; I can’t remember why I had done that while getting BIND to work on my system, but you should be able to safely skip that step. Thanks David.

Next, we need to tell BIND which DNS servers it has to forward to any queries that it cannot answer directly either because they resolve yet locally unknown domain names or because the cached results have expired.

Open the file /etc/named.conf as sudo (unless you have it opened already from the previous step) with vim or your favourite editor, and add the following lines to the options section:

forwarders {
    208.67.222.222;
    208.67.220.220;
};

In this example, I am using OpenDNS’ servers, but you can use Norton’s public DNS (198.153.192.1, 198.153.194.1), Google Public DNS (8.8.8.8, 8.8.4.4), UltraDNS (156.154.70.1, 156.154.71.1) or whichever other DNS servers you prefer or that work best for you.

Now, depending on the version of OS X you are using, you may need or not to create the following – just skip this if you already have the folder /System/Library/StartupItems/BIND.

sudo mkdir -p /System/Library/StartupItems/BIND
sudo nano /System/Library/StartupItems/BIND/BIND

Copy the following lines in the file you’ve just created (unless it was already there), and save.

#!/bin/sh
. /etc/rc.common

if [ "${DNSSERVER}" = "-YES-" ]; then
    /usr/sbin/named
fi

Then make it executable

sudo chmod +x /System/Library/StartupItems/BIND/BIND

In the same folder, create the file

sudo vim /System/Library/StartupItems/BIND/StartupParameters.plist

and copy the following lines in it:

{
    Description = "DNS Server";
    Provides = ("DNS Server");
    OrderPreference = "None";
    Messages =
    {
        start = "Starting BIND…";
        stop = "Stopping BIND…";
    };
}

By default, the DNS server is set not to start at boot. Let’s change that by opening the file

sudo vim /etc/hostconfig

and changing the content so that it contains the line

DNSSERVER=-YES-

Save, then either reboot or load BIND manually for the current session with

sudo /System/Library/StartupItems/BIND/BIND

At this stage BIND should be up and running, but it is not used yet. You will need to go to System Preferences > Network > Advanced > DNS, and replace all the current DNS servers with the only 127.0.0.1 so that your local DNS server is used instead. To make sure this is working as expected, type in your terminal

scutil --dns

You should see an output similar to this:

DNS configuration
resolver #1
domain : config
nameserver[0] : 127.0.0.1
order   : 200000
....

Another thing that may be useful to know is how to flush the DNS cache should you need to do so for any reason:

sudo rndc -p 54 flush && dscacheutil -flushcache

You should now have and be using a local DNS cache and your Internet browsing should feel faster. Please let me know in the comments if this is the case for you as well or whether you see different results.

Migrating large databases or tables from Microsoft SQL Server to MySQL

I am currently in the process of migrating some data -about 8 million rows- from a legacy environment based on Microsoft SQL Server as RDBMS, to a new production environment using MySQL instead.

8 million rows aren’t exactly what I mean by “large dataset”, sure (I have experienced the pain with much larger datasets), nevertheless the process could take some time, especially if both systems are in use at the time the migration occurs and the two environments do not share the same local network, which also means you will have to transfer the data via an Internet connection or some other means.

There are various ways of transferring data between these two RDBMS systems; Microsoft SQL Server makes it pretty trivial to export data in a visual, user friendly way with the “Export Data” task, for instance. However, I prefer a command-line approach on both systems as I find it to be more flexible -especially if you need to repeat the process a number of times with some trial and error because of data formats, encoding, and whatnot, that require changes to the SQL statement more than once. Most people use the SQL Server Management Studio for most tasks with SQL Server, but there’s also the bundled utility SQLCMD, that functions as a command line interpreter for Transact SQL statements. It is particular handy for our purpose as it can process a text file containing any SQL statement, so if we need to change that statement all we have to do is update the text file and reprocess it with SQLCMD.

So, this is in a few simple steps what I am doing right now -while I write- to migrate data from SQL Server to MySQL. As an intermediate format, with are going to use the plain old CSV as it is the easiest and can be imported pretty quickly in MySQL, as we’ll see.

Exporting

Once you know which data you need to export, create a text file -let’s call it “query.sql”, and paste in it the SQL statement you’ll need to execute to select that data. Note: to ensure MySQL can correctly import all the data, regardless of the content of string columns, it is recommended to enclose each field value with a special character to use as delimiter; we’ll use the double quotes as the character delimiting the beginning and the end of each field value, by using the QUOTENAME function.

For example, the query I had to run to select the data that I need to migrate is as follows:


SELECT
    SrId, LeadId, QUOTENAME(Email) As Email, SrDate, ClientID, CampaignId, AffiliateId
FROM
    T_ClientLeads WITH NOLOCK
WHERE
    CampaignId IN ( ... )

Note: I suggest you also tell SQL server to ignore locks with NOLOCK if you have a big source dataset and the server is in use. This will speed up the process quite a bit. I’d also recommend to run a test first with a limited dataset (for example, selecting the top N rows), just in case you’ll need to tweak formats, encoding or the data itself.

Next, we’ll process this query file with SQLCMD, asking it to generate a CSV file for us:


SQLCMD -S localhost -h-1 -E -W -d c_data2fr -i leads.sql -o "D:\leads.csv" -s"|"  

The meaning of the options I am using here is as follows (“SQLCMD /?” shows all the available ones):


-s => host to connect to (in my case, the same host running the SQL Server instance

-h-1 => switches CSV column headers on

-E => makes use of a trusted connection (see I am not specifying credentials here)

-W => remove trailing spaces / blanks from each string field value (CSV looks
            cleaner and will be a lot smaller in size too)

-d => the database we want to export data from

-i => use a text file containing a SQL statement (as SQLCMD can also be used
          to process SQL statements inline with the "-q" argument)

-o => the destination filename

-s => column separator

If all goes well, your destination file will be filled with your test data (I am assuming you are saving yourself some time and running a test first, right?). However, you will notice that at the end of the file SQLCMD adds an annoying


X rows affected.  

(where X stands for the number of rows returned by your query). This will later make the import process either fail or give you warnings or import some extra rows with wrong data. To fix that, change slightly your query to remove that information with NOCOUNT:

SET NOCOUNT ON
SELECT
    SrId, LeadId, QUOTENAME(Email) As Email, SrDate, ClientID, CampaignId, AffiliateId
FROM
    T_ClientLeads NOLOCK
WHERE
    campaignid IN ( ... )

By simply prepending “SET NO COUNT” it will have effect right away without having to issue a “GO” statement/command first.

Importing

Your data should now be ready to be imported into a MySQL database, provided you have already transferred it to the destination host, and have already created a MySQL table with a schema compatible with the columns stored in your CSV file (the order of the columns is also important). A pretty fast way of importing CSV data into MySQL is with the “LOAD DATA” command. It’s very simple, and very, very fast. Just open a session with your MySQL database using your favourite client, and enter this statement (obviously, change the file name as well as the table name):

LOAD DATA LOCAL INFILE 'leads.csv'
INTO TABLE T_ClientLeads FIELDS TERMINATED BY '|' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n';

Here we are simply telling MySQL to import the data contained in our CSV file, using the same delimiters we have used when exporting from SQL Server with SQLCMD. If all goes well -depending on the nature of your data, you may have to apply some transformations directly in the export query- you should see an output similar to the following once the import is complete:

Query OK, 7991285 rows affected, 14 warnings (12 min 22.15 sec)
Records: 7991285  Deleted: 0  Skipped: 0  Warnings: 14

You may see some warnings, but often these can be safely ignored depending on the case (truncated values, for example, if you know why you are limiting the length of a field in the destination table). Check the imported data though, as these warnings (if any) may have a different meaning depending on the data. The time taken obviously depends on the amount of data to migrate, both the origin and destination hosts, and the load on the systems while you are exporting and importing. However the process is reasonably speedy, especially if your destination table can use the MyISAM storage engine (like in my case), but this will also depend on the applications that use that database (transactions, relations, and all that goodness). As you can see, importing 8 million rows on a 4GB Linode VPS took a good 12 minutes (although the VPS has a very, very light load at the moment).

Know a better/easier/faster way of migrating data between these two RDBMS systems? I am all ears!