Sunday 19 December 2010

Words over time

Google Labs have a tool that exploits the large corpus of words they have built up through digitising all those Google Books.

A chart of the percentage occurrence of the words "art", "design" and "science" from 1810 to 2000. See it live here.
It is worth bearing in mind that none of the example words ("art", "design" and "science") continued to mean the same thing throughout that period.

Something fifhy going on 
All visualisations depend on the quality of the underlying data and this is where this tool falls down.

I thought I would check for occurrences of a word which is unlikely to be much affected by fashion, from 1700 to 2000: I tried "fish". This produced some highly suspect results with very low occurrences before 1800 and a dramatic rise at that time:

A chart of the percentage occurrence of the word "fish" from 1700 to 2000. See it live here.

And then I realised what is going on.

Google have not corrected the long-tailed s's which the scanning software thinks are f's. If you chart the nonexistent word "fifh" you find a steady climb which dramatically drops during 1780-1800 when the long s was replaced by the one we use now.  Charting "fish,fifh" shows both. Together they make a more sensible picture:
A chart of the percentage occurrence of the word "fish" and "fifh" from 1700 to 2000. See it live here.
Charting "defign,design" shows a similar pattern, as does "fcience,science".

This is idleness on Google's part and undermines the usefulness of the tool. I am sure it would be perfectly possible to make their Optical Character Recognition software tell the difference between a long-tailed s and an f, since they are not the same glyph:
Different glyphs for f and long s. From Joseph Priestley, 1764, A Description of a Chart of Biography on Google Books.
The crossbar of a long tailed s extends only to the left, as in "science" in the first line here, whereas the crossbar of an f extends both sides, as in "therefore" in line one/two.

A tool to use with care, indeed suspicion.

Friday 17 December 2010

Ben Fry's watching the evolution of the Origin of Species

Ben Fry, co-inventor of Processing, has created an intriguing visualisation of the chapters of Darwin’s Origin of Species, showing how they alter between each of the six editions that Darwin produced between 1859 and 1876.
Ben Fry. 2010. On the Origin of Species: The Preservation of Favoured Traces.
Used with permission.
These are not representations of time as such, but representations of change. Ben writes:
The idea that we can actually see change over time in a person’s thinking is fascinating. Darwin scholars are of course familiar with this story, but here we can view it directly, both on a macro-level as it animates, or word-by-word as we examine pieces of the text more closely.
This is where the hidden depths of the project lie. At first sight ‘just’ a visualisation, this is actually an interface to the full text of all the editions, based on van Wyhe et al.’s Complete Work of Charles Darwin Online.

Ben has written a book for O'Reilly Visualizing Data: Exploring and Explaining Data with the Processing Environment based on his 2004 PhD dissertation at the MIT Media Lab Computational Information Design [PDF].