has_many :codes

Using Google’s ‘define’ search feature from your terminal

Published  

(Update 05/04/2012: Google have slightly changed the format of URLs for search, so I have updated the snippets to take this and other small changes into account)

Since I started writing content for this blog, and as English is not my mother tongue, I find myself using quite often various tools that help me either chose the right word for something I am trying to communicate, or check that the syntax of a sentence is correct, so that the content is readable enough.

One of such tools is -unsurprisingly- Google: by searching for two different terms or phrases within the double quotes (what Google calls “phrase search“), I can see which one yields most results and therefore is more likely to be correct English. But even more useful is the define search feature: by prepending the text “define:” to your search term or query, you can instruct Google to search for and return directly definitions from various sources for that term or phrase, rather than a bunch of links.

I have been using define a lot lately, but at some point I got a bit tired of opening a new browser tab or window each time I had to double check the definition for a word (too much energy, you know…), so I have been toying with a little hack that now lets me use the same feature from within the terminal much more quickly, given that I always have at least one or two terminals open at any time.

There are a few command line utilities you can use to fetch web pages, with wget being one of the most popular. To fetch for example the definitions for the word “blog” from Google define using wget, all what you need to do is type a command like the following:

wget -qO- http://www.google.co.uk/search\?q\=blog\&tbs\=dfn:1

where the option “-qO-” simply tells wget to output the content of the page downloaded directly to screen (or STDOUT) rather than to file. You’ll notice that wget seems to be performing the request as expected, however it shows no output. This is because -it seems- a user agent is required. So let’s try again specifying a user agent such as “Firefox”:

wget -qO- -U "Mozilla/6.0 (Macintosh; I; Intel Mac OS X 11_7_9; de-LI; rv:1.9b4) Gecko/2012010317 Firefox/10.0a4" http://www.google.co.uk/search\?q\=blog\&tbs\=dfn:1

You should now see the HTML of the page as a browser would see it. Problem is, this is not really readable, is it? Next step is to strip all the html tags so that we can only preserve the actual content we are looking for: the definitions for our search term or phrase. We can do this easily by processing the HTML with grep and instructing it to only return li HTML elements since -you can check in the HTML- the li elements in the page correspond to the various definitions returned for your search query.

wget -qO- -U "Mozilla/6.0 (Macintosh; I; Intel Mac OS X 11_7_9; de-LI; rv:1.9b4) Gecko/2012010317 Firefox/10.0a4" http://www.google.co.uk/search\?q\=blog\&tbs\=dfn:1 \
| grep --perl-regexp --only-matching '(?<=
<li style="list-style:none">)[^<]+'

In the pipe above, we tell grep to process wget’s output and use the regular expression provided as argument to return only the parts of each matching line that match the pattern, that is in -in this case- all the li elements present in the page returned by Google. If you try the command above you will now see an output similar to the following for the word “blog”:

read, write, or edit a shared on-line journal
web log: a shared on-line journal where people can post diary entries about their personal experiences and hobbies; "postings on a blog are usually in chronological order"
A blog (a contraction of the term "web log") is a type of website, usually maintained by an individual with regular entries of commentary, descriptions of events, or other material such as graphics or video. Entries are commonly displayed in reverse-chronological order. ...
website that allows users to reflect, share opinions, and discuss various topics in the form of an online journal while readers may comment on posts. ...
blogger - a person who keeps and updates a blog
(cut)

This is a lot better, but we can still improve it further by adding line numbers (with the command nl) and making sure that HTML entities, if any, are displayed correctly in the terminal (we are not using a browser, after all). This can be done by using once again perl and in particular it’s decode_entities() method:

wget -qO- -U "Mozilla/6.0 (Macintosh; I; Intel Mac OS X 11_7_9; de-LI; rv:1.9b4) Gecko/2012010317 Firefox/10.0a4" http://www.google.co.uk/search\?q\=blog\&tbs\=dfn:1 \
| grep --perl-regexp --only-matching '(?<=
<li style="list-style:none">)[^<]+' \
| nl | perl -MHTML::Entities -pe 'decode_entities($_)'

You should now see a more readable output similar to the following:

1 read, write, or edit a shared on-line journal
2 web log: a shared on-line journal where people can post diary entries about their personal experiences and hobbies; "postings on a blog are usually in chronological order"
3 A blog (a contraction of the term "web log") is a type of website, usually maintained by an individual with regular entries of commentary, descriptions of events, or other material such as graphics or video. Entries are commonly displayed in reverse-chronological order. ...
4 website that allows users to reflect, share opinions, and discuss various topics in the form of an online journal while readers may comment on posts. ...
5 blogger - a person who keeps and updates a blog
(cut)

Now edit your .bash_profile file (or equivalent for the shell you use – if different than bash, you may have to adapt slightly the code) and add this function:

define() {
wget -qO- -U "Mozilla/6.0 (Macintosh; I; Intel Mac OS X 11_7_9; de-LI; rv:1.9b4) Gecko/2012010317 Firefox/10.0a4" http://www.google.co.uk/search\?q\=$@\&tbs\=dfn:1 \
| grep -Po '(?<=
<li style="list-style:none">)[^<]+' \ | nl \ | perl -MHTML::Entities -pe 'decode_entities($_)' 2>/dev/null;
}

Then, to finally use the trick from your terminal, all you have to do is enter a command like:

define blog

I love this kind of tricks as they make the use of our dear terminal even more productive. I am sure other Google search features -as well as other web services- can be as useful when consumed from the terminal; we’ll have a look at some more examples later on.

© Vito Botta