Every now and then, and perhaps particularly so when working on a wiki, we technical writers need to manipulate our content in some way that’s not provided by our content management system. A few times recently, I’ve dabbled with Python to solve some problems. Do you often find the need to wrangle your content outside your CMS, and do you use Python or another scripting tool?
Python is a scripting language. It’s easy to learn, especially if you’ve done some programming in other languages. It’s just the ticket for data manipulation. It also offers a number of useful libraries. For example:
- There are various libraries that you can use to access a web application via a SOAP or an XML-RPC remote API. I use the “xmlrpc.client” library in a few scripts, to get access to Confluence data.
- The “os” library is useful for creating directories on the local file system of the computer you’re running on. For example, I use it to create a directory for the script’s output file.
- The “re” library offers regular expression functions.
A script to find duplicate page names across Confluence spaces
This was the first Python script that I wrote to wrangle Confluence data. I started with a specific problem: I had five text files, each containing a list of page names. These were the pages in five Confluence spaces, that we needed to copy into another, single space. The problem is that Confluence does not allow duplicate page names within a space. So I needed to check my lists for matching page names.
I hacked together a Python script that checked for duplicate page names. The script reads a text file containing Confluence space keys and page names, and reports on duplicate page names. My first script used nested lists to store and compare the page names. A kind Atlassian developer reviewed the script and suggested I use a dictionary instead. So I did. A dictionary stores data in key-value pairs. Much neater!
Then I thought: Some people may not have their page names in a handy text file. They may want to get a list of all pages in a Confluence space. So I wrote a script to get the names of all pages in a given set of Confluence spaces.
The details of the scripts are in this post: How to find duplicate page names across Confluence spaces.
A script to get the source code of all pages in a Confluence space, for a full-text search
The search functionality in the Confluence web interface will return results from the visible content of the page, but it cannot get inside the XML-like elements that make up the Confluence storage format. For example, it’s not possible to find all pages that reference a certain image. And you can’t search for macro parameter values. This means, for example, you can’t search for all pages that include content from a given page.
Just recently I wrote a script that gets the XML storage format of all pages in a given Confluence space, and puts the code into text files on your local machine. Then you can use a powerful full-text search like grep, to find what you need. The details are in this post: Confluence full-text search using Python and grep
More on the way
I’m currently writing a couple more Python scripts to solve another problem. I’ll blog about it when I’ve finished.
If you’re interesting in Python, here are some links you many find useful:
A chuckle, courtesy of the Python technical writers
From the Python documentation:
By the way, the language is named after the BBC show “Monty Python’s Flying Circus” and has nothing to do with reptiles. Making references to Monty Python skits in documentation is not only allowed, it is encouraged!
Probably not a python
Last week I was lucky enough to be in New Orleans in the USA. I went on a tour of the Honey Island swamp, and saw this snake coiled comfortably on a tree trunk. I’m not sure what type of snake it is. Maybe a Copperhead:
What do you use?
Do you often use Python or some other scripting tool to automate those pesky tasks your CMS can’t handle?
STC Summit 2013 is fast approaching. I’m looking forward to getting the latest gen on all things #techcomm, meeting old friends, and making new acquaintances. I’ll also be giving a presentation on doc sprints!
Update on Wednesday 7 May 2013: The report on the actual presentation is now available: http://ffeathers.wordpress.com/2013/05/08/doc-sprints-at-stc-summit-2013-the-presentation/
A doc sprint is similar to a book sprint. It’s an event where a group of people get together for a couple of days and write tutorials, or a book, or other forms of documentation. Often there’s coding involved too. And always, plenty of fun, making new contacts, and learning cool new technologies.
Doc Sprints: The Ultimate in Collaborative Document Development
My presentation is called Doc Sprints: The Ultimate in Collaborative Document Development. It’s full of information about planning and running a doc sprint, and how doc sprints are useful in developing the documentation our readers need.
Even more exciting: there are a number of stories and tips, gleaned from doc sprinters around the world. Thanks to Anne Gentle, Swapnil Ogale, Ellis Pratt, Katya Stepalina, Andreas Spall, Jay Meissner, and Peter Lubbers, for contributing their ideas!
The presentation covers these topics:
- Introduction to doc sprints, agile environments, and why a doc sprint is a good fit for technical documentation.
- Who to invite, when to start, and how to ensure that the sprint will produce the documents you need.
- How to get the best out of the sprinters.
- Collaborative tools for use during the sprint.
- Sprinting across the world: Handling multiple time zones, early sprinters, late sprinters.
- How to run a retrospective, and why.
- Reviewing and publishing the documents, and writing up the results.
- Other innovative types of sprints for documentation teams.
Here’s what the presentation looked like a few weeks ago:
Come to my session at STC Summit 2013 to see how it’s turned out.