Sunday 19 December 2010

Words over time

Google Labs have a tool that exploits the large corpus of words they have built up through digitising all those Google Books.

A chart of the percentage occurrence of the words "art", "design" and "science" from 1810 to 2000. See it live here.
It is worth bearing in mind that none of the example words ("art", "design" and "science") continued to mean the same thing throughout that period.

Something fifhy going on 
All visualisations depend on the quality of the underlying data and this is where this tool falls down.

I thought I would check for occurrences of a word which is unlikely to be much affected by fashion, from 1700 to 2000: I tried "fish". This produced some highly suspect results with very low occurrences before 1800 and a dramatic rise at that time:

A chart of the percentage occurrence of the word "fish" from 1700 to 2000. See it live here.

And then I realised what is going on.

Google have not corrected the long-tailed s's which the scanning software thinks are f's. If you chart the nonexistent word "fifh" you find a steady climb which dramatically drops during 1780-1800 when the long s was replaced by the one we use now.  Charting "fish,fifh" shows both. Together they make a more sensible picture:
A chart of the percentage occurrence of the word "fish" and "fifh" from 1700 to 2000. See it live here.
Charting "defign,design" shows a similar pattern, as does "fcience,science".

This is idleness on Google's part and undermines the usefulness of the tool. I am sure it would be perfectly possible to make their Optical Character Recognition software tell the difference between a long-tailed s and an f, since they are not the same glyph:
Different glyphs for f and long s. From Joseph Priestley, 1764, A Description of a Chart of Biography on Google Books.
The crossbar of a long tailed s extends only to the left, as in "science" in the first line here, whereas the crossbar of an f extends both sides, as in "therefore" in line one/two.

A tool to use with care, indeed suspicion.

1 comment:

  1. This is really fascinating. Tim Shortis and I developed (and published) a teaching resource to help students understand when spelling/orthography standardised in English. We used 40 words from a preface Caxton wrote (ie long-rooted words like your fish), and then had students using the online OED to work out when they appear to have settled into their modern spelling/orthography. They then had to plot their findings in a shared Excel workbook which then produced a similar visual representation of change over time. I shall definitely re-run the activity with this little Google widget to see how it compares...

    ReplyDelete