Blog Archives

WtD Prague: Localisation of open source docs

This week I’m attending Write the Docs Prague. It’s super exciting to attend a European Write the Docs conference, and to be visiting the lovely city of Prague. This post contains my notes from a talk at the conference. All credit goes to the presenter, any mistakes are my own.

Zachary Sarah Corleissen‘s talk was titled, “Found in Translation: Lessons from a Year of Open Source Localization”.

[From Sarah Maddox, author of this blog: Localisation is the process of translating content to different languages, and of adapting the content to take regional idioms and linguistic customs into account.]

Zach’s experience comes from localising the Kubernetes docs.

Advantages of localisation

Zach discussed the advantages of localising an open source project. Localisation opens doors to a wider audience. It’s a tool to drive adoption of a product. Localisation also offers the opportunity for more people to contribute new features to a product. It therefore distributes power within the open source project.

When the Kubernetes docs managers considered localising the docs, they made some assumptions that later proved to be unfounded. For instance, they thought the localisation contributors would contribute only to their language. That proved not to be the case. Localisation contributors update the English source as well as their own language source, and they also help other localisation teams get started. For example the French teams help other teams get started with localisation infrastructure, and groups of related languages get together to define grammatical structures for technology-specific terms, such as “le Pod”. Thus the localisation contributors embody the best of open source contributions.

Localised pages increase the number of page views, which is a good thing for a doc set. Zach showed us some stats from Google Analytics with some impressive numbers. Each language added around 1% page views, which represents a big number in a doc set as large as Kubernetes.

Zach said we should also consider the support ratio that the localised docs provide. For example, there are 8 localisation contributors for the Korean docs, catering for 55,187 readers. So, 8 : 55,187 is a ratio of 1 : 6,900.


These are some of the nuggets of advice Zach shared:

  • Let each of the local teams determine for themselves how they create the localised content. That fits in best with open source philosophy, and the local teams know their local customs best.
  • The Kubernetes project does require that the localisation teams adhere to the Kubernetes code of conduct, and that the code of conduct is one of the first docs translated.
  • Bottlenecks include infrastructure, permissions, and filtering by language. You need to put solutions in place to manage these bottlenecks.
  • Trust is essential to collaboration.
  • To make it possible for a high level of mutual trust, make sure the boundaries are clear, and be careful with the permissions that you assign in the repository.
  • Choose a site generator that has strong multi-language support. A good one is Hugo. Jekyll makes things very difficult.
  • Filter the issues and pull requests by language. Zach doesn’t know of any good tools for this filtering. (If you know of any, shoot him a tweet.) Zach mentioned some possibilities from the Kubernetes world: Prow is a possibility but it’s a heavyweight tool for just localisation. Another option is zparnold’s language labeler.
  • Use version control to review and approve by language. Require review and approval from a user in a different company from the submitter of the pull request.

Some cautionary tales:

  • Look out for raw machine-generated content.
  • Make sure the translators are not being exploited as free labour. Even if you’re not directly engaging the translators, take steps to ensure ethical content.


I learned a lot from this session. It was especially relevant as we’re starting to consider localisation of the Kubeflow docs which I work on. Thank you Zach for a very informative session.

Intelligent content at stc17

This week I’m attending STC Summit 2017, the annual conference of the Society for Technical Communication. These are my notes from one of the sessions at the conference. All credit goes to the presenter, and any mistakes are mine.

Val Swisher presented a session called “The Holy Trifecta of Intelligent Technical Content”. The trifecta comprises structured intelligent technical content, terminology management, and translation memory. With these three, technical writers can efficiently produce content for multiple channels, for an international audience.

Val explained each of the three elements (structured content, source terminology management, and translation memory) and the magic that happens when you use them all together. Using the three together makes content development better, cheaper, and faster.

Structured authoring

Val walked us through the original content development process, where a writer wrote the content, then passed it off for translation and desktop publishing. This process was slow, expensive, and gave the writer little control.

In a structured environment, the author writes smaller chunks of content (sometimes called topics) and checks it into a CMS. The information product (PDF file, web page, book, etc) are a collection of these chunks in a certain order. In theory, you should be able to combine the chunks in different orders and arrangements for different content products.

Structured authoring should therefore produce more deliverables through content reuse, create consistency, and support multichannel publishing.

The content itself is separated from the eventual publication style and medium. Desktop publishing is a thing of the past.

Each individual chunk is independently translated. Each chunk is now in the database with its related translations.

There are a few problems to solve. In particular, terminology. For example, what do you do to a button: Click, Click on, Tap, Select, Hit… We’re not consistent in our use of terminology in our source.

Source terminology

We need to manage our source terminology. People do it in various ways, such as via a document or style guide, via reviews (tribal knowledge), or via a specific tool.

Val emphasised the importance of picking one term for a particular thing or concept. For example, when talking about a dog, choose a word: doc, pooch, hound – it often doesn’t matter which term you pick, provided you’re consistent.

No-one reads style guides! Everyone wants to, because we all want to do a great job. But no-one has the time. Also, it’s hard to know whether the word you’re about to write is a managed term.

We need a way to manage the words we’re using and how we’re using them, that we don’t have to go and look for. The information must be pushed to us.

It’s almost better not to have structured authoring if you don’t manage your terminology. We split the topic development amongst a group of writers, which leads to greater problems with consistency. Val showed us a screenshot from an automated terminology tool, which allows you to define preferred terms, banned terms, etc, and then prompts the authors when they use a deprecated word.

Translation memory

Val asked the audience whether we had translation memory (TM), whether our company owned the translation memory, whether we had more than one translation vendor, and whether those vendors shared the same memory. She stressed the importance of owning your own translation memory.

Translation memory (TM) is one of the automated tools that the translation vendor uses. If something in the source content has already been translated, the tool pops up the translation. This is because the translations are stored in a database called the translation memory. The bits of source content are stored as translation units, which are phrases, usually more than a word.

This makes translation cheaper. If you say the same thing in exactly the same way each time you say it, the tool pulls up the same translation as used the first time. This is called a 100% match. Note that a 100% match doesn’t cost zero dollars. To have no charge, you have to have an in-context match.

Val emphasised that you should make sure that what’s in the TM is pushed to the writers, although she knows of very few companies that are doing this. That way, writers would know what’s already been translated and be able to ensure we use the same terms when developing new content.

Ideally, there’s be an automated link from the translation memory to the terminology management system. But that’s complicated, and doesn’t happen often.

Tying them together

Val discussed the intersection of three technology areas:

  • Structured authoring – write it once, use it many times.
  • Terminology management – say the same thing the same way, every time you say it. Be as boring as you can and as simple as you can.
  • Translation memory – use already-translated terms in your source content.

This takes a lot of setup and maintenance.  But it’s worth it.


Val’s presentation was funny, engaging, and informative. She had the audience nodding and laughing throughout the session. Thanks Val!

%d bloggers like this: