What colour is your document – etymology visualisation

For a bit of fun, I’ve been playing around with the Shakspere thread on the Etymology Discovery Message Board. It’s pretty cool. You paste in a chunk of text, and a text parser colours each word based on the etymology of the word. The parser recognises origins in Old Norse, German, Old English, Middle English, various flavours of French, Latin, and Greek.

When I started out, I thought we may be able to see some characteristic of technical documentation as opposed to other types of writing. For example, since technical documentation is rather forthright and not known for its romantic atmosphere, perhaps most of the words would be of Germanic or Old English origin. I guess medical texts would have a higher content of Latin than other writing does.

In fact, my very small samples don’t lead to any definite conclusions. That’s not surprising. In fact, one of my non-technical blog posts (“Jazz in Harlem”) has a higher Old English ratio that the technical documents.

Without further ado, here are the samples I chose, and the resulting etymological visualisations.

Colour coding on the etymology visualisation pages

The colour-coding on Shakspere is as follows:

(You’ve probably noticed the typo. I was tempted to fix it just for the screenshot – ha ha, been there, done that – but decided not to. If you haven’t noticed it, don’t worry. Spotting such things is one of the hazards of being a technical writer!)

Technical document: “About Confluence”

Source: A short document introducing Confluence wiki: About Confluence,

Etymology visualisation from Shakspere for “About Confluence”:

Etymology graph for “About Confluence”:

Technical document: “Installing JIRA on Windows”

Source: An installation guide for JIRA web app: Installing JIRA on Windows

Part of the etymology visualisation from Shakspere for “Installing JIRA on Windows”:

Etymology graph for “Installing JIRA on Windows”:

Technical blog post: “Comparing SharePoint and Confluence”

Source: A technical writer’s blog post (mine): Comparing SharePoint and Confluence.

Part of the etymology visualisation from Shakspere for “Comparing SharePoint and Confluence”:

Etymology graph for “Comparing SharePoint and Confluence”:

Quirky travel post: “Jazz in Harlem”

I picked this one because, although it’s mine, it’s very different from technical writing. I find it interesting how much Old English there is in this post.

Source: A blog post by the Travelling Worm: Jazz in Harlem, New York.

Part of the etymology visualisation from Shakspere for “Jazz in Harlem”:

Etymology graph for “Jazz in Harlem”:

Technical blog post: “Returning to Findability”

Source: A technical writer’s blog post from Tom Johnson: Returning to Findability.

Part of the etymology visualisation from Shakspere for “Returning to Findability”:

Etymology graph for “Returning to Findability”:

Technical blog post: “Improve tech comm by knowing a foreign language”

I found this one particularly interesting. The author of this sample, Kai Weber, is a technical writer whose first language is German. He does almost all his technical writing in English. I wondered if there would be a higher percentage of German words in the sample. But instead, there’s a significantly higher proportion of words from Old French than in other samples.

Source: A technical writer’s blog post from Kai Weber: Improve tech comm by knowing a foreign language.

Part of the etymology visualisation from Shakspere for “Improve tech comm by knowing a foreign language”:

Etymology graph for “Improve tech comm by knowing a foreign language”:

Company blog post: “Summit Aftermath – My 5 Highlights”

Given the above results, I thought it would be interesting to see another English post by a German author.

Source: A  company blog post from Sven Peters: Summit Aftermath – My 5 Highlights.

Part of the etymology visualisation from Shakspere for “Summit Aftermath – My 5 Highlights”:

Etymology graph for “Summit Aftermath – My 5 Highlights”:

Technical blog post: “Facebook for Social Support? I like.”

Source: A technical writer’s blog post  from Anne Gentle: Facebook for Social Support? I like.

Etymology visualisation from Shakspere for “Facebook for Social Support? I like.”

Etymology graph for “Facebook for Social Support? I like.”

How did I make the graphs?

The graphs in this post are derived from the data on the Shakspere pages. Here’s how I did it:

  • Go to the Shakspere page that shows the colour-coded etymology. For example: a visualisation for “About Confluence”.
  • View the HTML source of the page.
  • Go to the bottom of the Shakspere HTML source to find the codes for each language that Shakspere recognises. This is what I found:
    The colour-coding is as follows:<br />
    <span class="onr">Words of Old Norse origin</span><br/>
    <span class="grm">Words of German origin</span><br/>
    <span class="oeg">Words of Old English origin</span><br/>
    <span class="mde">Words of Middle English origin</span><br/>
    <span class="ofr">Words of Old French origin</span><br/>
    <span class="frc">Words of French origin</span><br/>
    <span class="afr">Words of Anlgo-French origin</span><br/>
    <span class="mfr">Words of Middle French origin</span><br/>
    <span class="lat">Words of Latin origin</span><br/>
    <span class="mdl">Words of Middle Latin origin</span><br/>
    <span class="modl">Words of Modern Latin origin</span><br/>
    <span class="grk">Words of Greek origin</span><br/>
  • Now move up in the HTML slightly, and find the line that contains this element:
    <div class="messagebody">
  • Copy the entire line from immediately after the above element, and paste it into Notepad++ or another text editor with a good “Find” function.
  • You’ll see that each word has a semantic tag indicating the language derivation that Shakspere found. For example, in this extract the word “Confluence” is marked as Latin and the word “is” is marked as Old English:
    <span class="lat">Confluence</span> <span class="oeg">is</span>
  • Use the “Find” function in Notepad++ to find the number of occurrences of each language.
  • Put your findings into an Excel spreadsheet, with a column for the language and a column for the number of words found.
  • Use Excel’s graph functions to create the graph.

Ready for more?

It was this post that led me to these etymological experiments: Visualizing English Word Origins, on the Ideas Illustrated blog.

Perhaps you have a medical text, or a romantic poem, to analyse? Drop it into Shakspere and let me know what you find. 🙂

About Sarah Maddox

Technical writer, author and blogger in Sydney

Posted on 8 July 2012, in language, technical writing and tagged , , , , , . Bookmark the permalink. 6 Comments.

  1. Oh, what fun! Thanks, Sarah, for “something completely different” and so inspiring!

    I’ll just run with it, leave aside the science and say that one of the things I appreciate about English is exactly it’s ability to assimilate vocabulary from different sources, to the point of redundancy. Trying to tease apart the differences between “freedom” and “liberty” could entertain this American Studies scholar for a few hours…

    • Hallo Kai

      Thanks for a great comment! Yes, sometimes it’s worth going off the beaten track, just to see what ideas you kick up. 🙂

      Cheers, Sarah

  2. This was fun, but I don’t think that it’s completely accurate. How is “psychologist” not derived from Greek? Or “benefit” not from French and Latin? For that matter, isn’t “wiki” from Hawaiian?

    • Hallo elise

      You’re right, the word derivation isn’t accurate. The creator of the tool explains that it’s experimental. The parsing is naive, and the etymology is based on a dictionary of 10000 words:

      Still, it’s a cool idea. Perhaps someone can suggest a better parser, which would improve the identification of the root of each word. For the entomological matching, a dictionary of 10000 words seems adequate.

      Cheers, Sarah

  3. Interesting – perhaps a feature idea for UX Write? 😉

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: