Category Archives: language
I’ve grabbed some Google Analytics statistics about the languages used by visitors to the Atlassian documentation wiki. The information is based on the language setting in people’s browsers. It’s a pretty cool way of judging whether we need to translate our documentation!
The statistics cover a period of 3 months, from 7 September to 7 December 2012.
Approximately 30% of our readers speak a language other than English. The most popular non-English language is German (approximately 7%), followed by French (approx 2.6%). Japanese is hard to quantify, because we have separate sites for Japanese content.
The pretty picture
This graph shows the results for the top 10 locales:
The grey sector represents a number of smaller segments, each one below 1%. In Google Analytics, I can see them by requesting more than 10 lines of data.
Here are the figures that back the above graph:
|Locale||Number of visits||Percentage of total|
More Google Analytics?
Google Analytics is a useful tool. If you’re interested in a couple more posts about it, try the Google Analytics tag on this blog. I hope the posts are interesting.
For a bit of fun, I’ve been playing around with the Shakspere thread on the Etymology Discovery Message Board. It’s pretty cool. You paste in a chunk of text, and a text parser colours each word based on the etymology of the word. The parser recognises origins in Old Norse, German, Old English, Middle English, various flavours of French, Latin, and Greek.
When I started out, I thought we may be able to see some characteristic of technical documentation as opposed to other types of writing. For example, since technical documentation is rather forthright and not known for its romantic atmosphere, perhaps most of the words would be of Germanic or Old English origin. I guess medical texts would have a higher content of Latin than other writing does.
In fact, my very small samples don’t lead to any definite conclusions. That’s not surprising. In fact, one of my non-technical blog posts (“Jazz in Harlem”) has a higher Old English ratio that the technical documents.
Without further ado, here are the samples I chose, and the resulting etymological visualisations.
Colour coding on the etymology visualisation pages
The colour-coding on Shakspere is as follows:
(You’ve probably noticed the typo. I was tempted to fix it just for the screenshot – ha ha, been there, done that – but decided not to. If you haven’t noticed it, don’t worry. Spotting such things is one of the hazards of being a technical writer!)
Technical document: “About Confluence”
Source: A short document introducing Confluence wiki: About Confluence,
Etymology visualisation from Shakspere for “About Confluence”:
Etymology graph for “About Confluence”:
Technical document: “Installing JIRA on Windows”
Source: An installation guide for JIRA web app: Installing JIRA on Windows
Part of the etymology visualisation from Shakspere for “Installing JIRA on Windows”:
Etymology graph for “Installing JIRA on Windows”:
Technical blog post: “Comparing SharePoint and Confluence”
Source: A technical writer’s blog post (mine): Comparing SharePoint and Confluence.
Part of the etymology visualisation from Shakspere for “Comparing SharePoint and Confluence”:
Etymology graph for “Comparing SharePoint and Confluence”:
Quirky travel post: “Jazz in Harlem”
I picked this one because, although it’s mine, it’s very different from technical writing. I find it interesting how much Old English there is in this post.
Source: A blog post by the Travelling Worm: Jazz in Harlem, New York.
Part of the etymology visualisation from Shakspere for “Jazz in Harlem”:
Etymology graph for “Jazz in Harlem”:
Technical blog post: “Returning to Findability”
Source: A technical writer’s blog post from Tom Johnson: Returning to Findability.
Part of the etymology visualisation from Shakspere for “Returning to Findability”:
Etymology graph for “Returning to Findability”:
Technical blog post: “Improve tech comm by knowing a foreign language”
I found this one particularly interesting. The author of this sample, Kai Weber, is a technical writer whose first language is German. He does almost all his technical writing in English. I wondered if there would be a higher percentage of German words in the sample. But instead, there’s a significantly higher proportion of words from Old French than in other samples.
Source: A technical writer’s blog post from Kai Weber: Improve tech comm by knowing a foreign language.
Part of the etymology visualisation from Shakspere for “Improve tech comm by knowing a foreign language”:
Etymology graph for “Improve tech comm by knowing a foreign language”:
Company blog post: “Summit Aftermath – My 5 Highlights”
Given the above results, I thought it would be interesting to see another English post by a German author.
Source: A company blog post from Sven Peters: Summit Aftermath – My 5 Highlights.
Part of the etymology visualisation from Shakspere for “Summit Aftermath – My 5 Highlights”:
Etymology graph for “Summit Aftermath – My 5 Highlights”:
Technical blog post: “Facebook for Social Support? I like.”
Source: A technical writer’s blog post from Anne Gentle: Facebook for Social Support? I like.
Etymology visualisation from Shakspere for “Facebook for Social Support? I like.”
Etymology graph for “Facebook for Social Support? I like.”
How did I make the graphs?
The graphs in this post are derived from the data on the Shakspere pages. Here’s how I did it:
- Go to the Shakspere page that shows the colour-coded etymology. For example: a visualisation for “About Confluence”.
- View the HTML source of the page.
- Go to the bottom of the Shakspere HTML source to find the codes for each language that Shakspere recognises. This is what I found:
The colour-coding is as follows:<br /> <span class="onr">Words of Old Norse origin</span><br/> <span class="grm">Words of German origin</span><br/> <span class="oeg">Words of Old English origin</span><br/> <span class="mde">Words of Middle English origin</span><br/> <span class="ofr">Words of Old French origin</span><br/> <span class="frc">Words of French origin</span><br/> <span class="afr">Words of Anlgo-French origin</span><br/> <span class="mfr">Words of Middle French origin</span><br/> <span class="lat">Words of Latin origin</span><br/> <span class="mdl">Words of Middle Latin origin</span><br/> <span class="modl">Words of Modern Latin origin</span><br/> <span class="grk">Words of Greek origin</span><br/>
- Now move up in the HTML slightly, and find the line that contains this element:
- Copy the entire line from immediately after the above element, and paste it into Notepad++ or another text editor with a good “Find” function.
- You’ll see that each word has a semantic tag indicating the language derivation that Shakspere found. For example, in this extract the word “Confluence” is marked as Latin and the word “is” is marked as Old English:
- Use the “Find” function in Notepad++ to find the number of occurrences of each language.
- Put your findings into an Excel spreadsheet, with a column for the language and a column for the number of words found.
- Use Excel’s graph functions to create the graph.
Ready for more?
It was this post that led me to these etymological experiments: Visualizing English Word Origins, on the Ideas Illustrated blog.
Perhaps you have a medical text, or a romantic poem, to analyse? Drop it into Shakspere and let me know what you find.
Is it OK to say “PDFs” instead of “PDF files” or “PDF documents”? A colleague, Paul, and I were discussing this crucial question a few days ago.
How about “emails” instead of “email messages”? Or, horror of horrors, “XMLs” instead of “XML files”? Believe it or not, I’ve seen that one crop up in our documentation. I’ve yet to see or hear anyone talk about “HTMLs” but even that’s not beyond the bounds of possibility.
Actually, I’ve grown accustomed to hearing “emails” and can even be heard to use that word myself on occasion. But I still try to steer clear of it in technical documentation. The use of “PDFs” irks me, and I doubt if I’ll ever like “XMLs”. But never say never.
Paul and I decided that our primary aim is to make sure the meaning is clear and to avoid annoying our readers. After all, as technical writers, we want to give people a smooth ride through the text.
What do you think?
Am I being a stickler for detail?
Talking of sticks, this gorgeous visitor appeared on our postbox a few days ago. It’s a stick insect. I put the peg there to give an idea of scale. The stick insect has six legs, two of which are close together on the right hand side of the picture. Its head is on the left.
This weekend I attended the ASTC-NSW 2010 conference in Sydney. These are the notes that I took during a session by Sarah Forget. All the credit for the content and ideas goes to Sarah. Any mistakes are mine.
Sarah Forget presented a session on preparing your documentation for translation, titled “How to write for translation: New challenges for the writer”.
Introducing the topic
In the course of her work, Sarah has seen many mistranslations. Most of those were caused by ambiguous English structures.
The non-functional literacy rate in Australia is 46%.
You need to know who the translators are, as an audience. This helps you to know how to write for them.
More than just translation, localisation includes a customisation of the message in the target language and culture.
- Regional settings
As well as being well written, your content must be well structured, to optimise the translation process.
Getting to know how the translators work
Translators use CAT tools (computer-assisted translation tools), which include the following functionality:
- Translation memory – Stores strings of text in the source and the target language. This helps the translator to keep consistency across their translations.
- Terminology manager – Stores concepts (“terminological values”) in two or more languages, including synonyms, antonyms, definitions, and so on. Example of such a tool: MultiTerm.
Sarah showed us a video of a translator’s desktop with the software in action while the translator was working. It showed how the translation memory prompts the translator with existing translations for each phrase, or with translations that are similar but not exact.
The formatting and layout of the source document affects the processing done by the translation memory tool. For example, tabs can be mistaken for an end of paragraph.
If you use consistent terminology and Plain English, the terminology manager (MultiTerm) can find the match easily. As a by the way, MultiTerm also educates you about what Plain English is.
Once the translator has translated a term, they can then choose to add it to the terminology manager.
The translator can save the document as a bilingual document, containing both languages so that they can continue working on it or make corrections after review. When the work is complete, they save the document in the target language.
Tips from Sarah
These are just some of the tips Sarah gave us:
1) Assume that your text will increase in size by approximately 30% after translation. To manage this expansion, add extra space in the original document, such as after paragraphs or in tables.
2) Avoid manual formatting.
- Use the templates, styles and automated capabilities provided by your authoring software.
- You don’t even need to include the table of contents for translation, if it’s automatically generated.
- Don’t add spaces, line breaks and so on. Manual line breaks, tabs and spaces will limit the efficiency of your CAT tool.
- Avoid manual hyphenation, because the conventions are different in different languages.
- If you take care with this sort of thing, then there will be no work for you to do when you get the document back from translation.
3) Use consistent and clear terminology throughout a single document, or even better throughout the documentation of the entire company.
- Avoid jargon, colloquialism and regional language.
- A good way to manage this is to create a glossary, defining the terms to use and those not to use. You can then also send this glossary to the translator.
4) Write sentences that can be understood without context. For example, this sentence is almost impossible to translate: “How is it used.” The pronoun “it” needs to be replaced by the masculine, feminine or neutral pronoun depending on the target language.
5) Ensure there is no ambiguity in the sentences. This sentence is an example of ambiguity: “Remove the part using the filter”. Do you use the filter to remove the part, or do you remove the part that uses the filter?
6) Include the articles in a sentence. Don’t leave them out, because they are often necessary to clarify meaning.
7) Don’t stack your nouns. Stacked nouns are particularly hard to translate. A comment from the floor caused some laughter there: “It’s called a ‘noun sandwich’!”
8 ) Keep the subject and verb close to each other. (Being technical writers, you’re probably wondering why there’s an extra space between the 8 and the bracket at the beginning of this line. It’s because WordPress is kindly converting 8 plus bracket to an icon of a smiley with sunglasses. 8) )
9) Use the appropriate punctuation.
10) Spell out acronyms, at least once at the beginning. Also tell your translators how to handle acronyms in the translated text.
11) Give your translators some contextual information. This affects the translation, because different terms mean different things depending on the industry or other context.
12) Send only signed-off documents to the translators. Otherwise you’ll end up paying for re-translation.
13) Create a working communication channel with your translator. They will have lots of questions to ask. This is important for a qualitative translation.
Sarah recommends a book: The Guide to Translation and Localization – Communicating with the Global Marketplace. You can request it from LingoSystems. They will send it free of charge.
This was a very useful session. At Atlassian, where I work, we don’t yet optimise our documentation for translation, but it’s something we’re going to need to do very soon. It’s also something I’m interested in personally. Thank you for all the information and tips, Sarah.
With possible titles of “why consistency works”, or “what Homer knew about technical writing”, or even “what Homer, technical communication and Eminem have in common”, this post is just a musing. And perhaps amusing, in a gentle way. If it provokes any thoughts or ideas in your head, I’d love to hear them!
Homer and other epic poets use a number of repeated phrases. Famous ones are:
- wine-dark sea
- the wily Odysseus
- rosy-fingered dawn
- gray-eyed Athene
- I’ma make a new plan*
* We could argue that rappers use them too. I suspect there’s a whole blog post right there!
People call these repetitions “stock phrases” and “epic formulae”. People say the use of such phrases is a trick to make it easier to fill a gap in a line of poetry, to keep to the metre when you’re thinking on your feet.
I think there’s a less prosaic purpose too.
In a single familiar phrase, a writer can call up all the connotations and associations that the reader has accrued over the years of exposure to a shared culture. With just two or three words, you can evoke a flare of imagination, a memory, a delightful frisson of fear or just some previously-learned facts.
Ah, the power! Being able to make your reader remember something complex in such a simple way. Efficient, concise, unobtrusive, pleasurable and rewarding.
We do the same with consistency of terminology in technical writing. Ba da boom.
Off topic: I’m fond of trees. I took this picture when walking in the bush this morning.