Python as a useful tool for technical writers

Every now and then, and perhaps particularly so when working on a wiki, we technical writers need to manipulate our content in some way that’s not provided by our content management system. A few times recently, I’ve dabbled with Python to solve some problems. Do you often find the need to wrangle your content outside your CMS, and do you use Python or another scripting tool?

Python is a scripting language. It’s easy to learn, especially if you’ve done some programming in other languages. It’s just the ticket for data manipulation. It also offers a number of useful libraries. For example:

  • There are various libraries that you can use to access a web application via a SOAP or an XML-RPC remote API. I use the “xmlrpc.client” library in a few scripts, to get access to Confluence data.
  • The “os” library is useful for creating directories on the local file system of the computer you’re running on. For example, I use it to create a directory for the script’s output file.
  • The “re” library offers regular expression functions.

A script to find duplicate page names across Confluence spaces

This was the first Python script that I wrote to wrangle Confluence data. I started with a specific problem: I had five text files, each containing a list of page names. These were the pages in five Confluence spaces, that we needed to copy into another, single space. The problem is that Confluence does not allow duplicate page names within a space. So I needed to check my lists for matching page names.

I hacked together a Python script that checked for duplicate page names. The script reads a text file containing Confluence space keys and page names, and reports on duplicate page names. My first script used nested lists to store and compare the page names. A kind Atlassian developer reviewed the script and suggested I use a dictionary instead. So I did. A dictionary stores data in key-value pairs. Much neater!

Then I thought: Some people may not have their page names in a handy text file. They may want to get a list of all pages in a Confluence space. So I wrote a script to get the names of all pages in a given set of Confluence spaces.

The details of the scripts are in this post: How to find duplicate page names across Confluence spaces.

A script to get the source code of all pages in a Confluence space, for a full-text search

The search functionality in the Confluence web interface will return results from the visible content of the page, but it cannot get inside the XML-like elements that make up the Confluence storage format. For example, it’s not possible to find all pages that reference a certain image. And you can’t search for macro parameter values. This means, for example, you can’t search for all pages that include content from a given page.

Just recently I wrote a script that gets the XML storage format of all pages in a given Confluence space, and puts the code into text files on your local machine. Then you can use a powerful full-text search like grep, to find what you need. The details are in this post: Confluence full-text search using Python and grep

More on the way

I’m currently writing a couple more Python scripts to solve another problem. I’ll blog about it when I’ve finished.

Resources

If you’re interesting in Python, here are some links you many find useful:

A chuckle, courtesy of the Python technical writers

From the Python documentation:

By the way, the language is named after the BBC show “Monty Python’s Flying Circus” and has nothing to do with reptiles. Making references to Monty Python skits in documentation is not only allowed, it is encouraged!

Probably not a python

Last week I was lucky enough to be in New Orleans in the USA. I went on a tour of the Honey Island swamp, and saw this snake coiled comfortably on a tree trunk. I’m not sure what type of snake it is. Maybe a Copperhead:

Python for technical writers

What do you use?

Do you often use Python or some other scripting tool to automate those pesky tasks your CMS can’t handle?

About these ads

About Sarah Maddox

Technical writer, author and blogger in Sydney

Posted on 15 May 2013, in Confluence, technical writing and tagged , , , , . Bookmark the permalink. 11 Comments.

  1. I’ve found Perl in conjunction with the Confluence CLI really useful. I wrote a script using this handy little perl module (http://search.cpan.org/~diberri/HTML-WikiConverter-Confluence-0.01/lib/HTML/WikiConverter/Confluence.pm) that imported all of the content from our old support center to our new one in Confluence. It wasn’t perfect, but it saved us a lot of time!

    I’ve heard good things about Python though, and I’d really like to try it. I’ve heard it’s good to learn (to program) with – do you think that’s true?

    • Hallo Beth

      That’s a cool script. Do you have any plans to update it for the new Confluence storage format instead of wiki markup? ;)

      I haven’t used Perl. Python is satisfying to use. One of the cool things, for example, is how it handles code blocks. Things like if statements and loops, for example. Instead of having to use curly brackets to delineate the block, you just indent all lines in the block by the same amount.

      Here’s an example of a for loop:


      for page in pages_list:
        # Get the content of the page
        page_content = server.confluence2.getPage(token, page["id"])
        # File name is equal to page name without special characters, plus page ID.
        # Use a regular expression (re) to strip non-alphanum characters from page name.
        page_name = page["title"]
        page_id = page["id"]
        page_name_qualified = (re.sub(r'([^\s\w]|_)+', '', page_name)) + "-" + page_id
        # Open the output file for writing. Will overwrite existing file.
        # File is in the required output directory.
        page_file = open(os.path.join(output_path, page_name_qualified), "w+")
        # Write a line containing the URL of the page
        page_file.write("**" + page["url"] + "**\n")
        # Write the page content to the file
        page_file.write(page_content["content"])
        page_file.close()
      print("All done! I've put the results in this directory: ", output_directory)

      Cheers
      Sarah

  2. I strongly encourage admins and power users to have some sort of scripting tool in their pockets. There are just too many interesting things to be done that are tedious to do manually

  3. Sarah, great that you’re suggesting Python for these kinds of tasks. I hope more tech writers try it out. As you say, it’s easy to pick up and the built-in libraries are very useful. It’s also very sensibly designed and readable – a lot of Python developers find it improves their productivity and is easier to maintain than some other languages.

    We do a lot with Python – some of the same kinds of tasks you describe, but also things that use its power as more than a scripting language. We automate much of our DITA code review, with an extensive and growing list of checks. We also produce GUI tools with Windows installers for other colleagues to use, so they don’t need to install Python or touch a command line!

    The resources you listed are very good, and I’d like to mention another that’s especially useful for people who have some previous coding experience and want to get up to speed with idiomatic Python. It’s Mark Pilgrim’s book “Dive Into Python 3″, the full text of which is here:

    http://getpython3.com/diveintopython3/

    When people research Python they’ll soon come across the Python 2/3 debate. I would suggest using Python 3, unless the task mandates the use of external libraries that don’t support it yet. There are lots of benefits to Python 3, including much easier handling of Unicode. Many of the popular external libraries now support it, including one that’s pretty much a must for any serious XML handling: LXML (http://lxml.de/).

    • Hallo Joe

      Thanks for the useful resources!

      For other people reading these comments: It’s worth noting that there are significant syntactical differences between Python 2 and Python 3. You’ll need to rewrite your code to make it work in each version. So choose wisely. :)

      BTW Joe, I love your blog. A very clean look. Your post about using Google+ comments is very interesting.

      Cheers
      Sarah

      • Sarah, good point. It can certainly be a chore to convert 2 to 3 or vice versa, though there are various tools and ways of working that can make it a bit easier.

        Thanks for the comments about my blog! Glad you like the look — I was going for clean and simple but hoped I hadn’t made it *too* spartan.

  4. Matthew Morgal

    Thanks for the links! I’ve recently started toying with programming beyond simple HTML/CSS, especially MySQL. Python is also on my list. Even for writers, the importance of diving into technology cannot be emphasized enough!

  5. If you want to get serious in Python, I recommend the PyCharm IDE by JetBrains. Great debugging facilities (stepping through code, examining variables, etc), good integration with version control systems, good control of appearance/layout. Better than Wing IDE and the free IDEs that are out there.

  1. Pingback: How to manage attachment usage in Confluence wiki with some Python scripts | ffeathers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,465 other followers

%d bloggers like this: