How to manage attachment usage in Confluence wiki with some Python scripts

Do you need to find out whether the attachments on a Confluence wiki page are used anywhere in the space? Having discovered they’re not, do you want to delete them from the page? I’m hoping this post will help.

The Confluence user interface doesn’t offer the option to delete attachments in bulk. Nor does it offer any way of cross-referencing attachment usage. You can’t get a list of attachments and find out where they’re used. So, I’ve written four Python scripts that you can run consecutively to do the following:

  • Get a list of all attachments on a given page.
  • Get the content of all pages in a given space.
  • Produce two reports, one listing the attachments that are not referenced anywhere in the space, and the second showing the attachments that are referenced and the pages that use them.
  • Accept a list of attachment names and delete them from a given page.

Our use case

In the Confluence documentation we have a page called Space Attachments Directory. It’s been there for yonks. It has an enormous number of screenshots attached to it (396, to be precise). The page was created in 2005, with the aim of storing screenshots that can be re-used on various pages. A good aim in principle, but in practice unmanageable when applied across a large space maintained by many authors. Various technical writers over the years have either used or not used this page and its attachments.

As a result, we didn’t know how many of the attachments are actually used anywhere in the space. I suspected that only a few of the attachments were still in use.

Python to the rescue.

The scripts

The four Python scripts are available on Bitbucket. Please feel free to download and use them. If you have any suggestions for improvement, I’d love to hear them.

A friendly warning: These scripts are provided “as is” and without any guarantees. I developed them to solve a specific problem. I’m sharing them because I hope they will be useful to others too. If you have any improvements to share, please let me know.

1. getConfluencePageAttachments.py: Gets all attachments on a given Confluence page. It puts the list of attachments into a text file, and prints a report of the number of attachments and total file size.

2. getConfluencePageContent.py: Gets the content of all pages in a given Confluence space. It puts the content of each page into a separate text file, in a given directory. The content is in the form of the Confluence “storage format”, which is a type of XML consisting of HTML with Confluence-specific elements. A note for the curious: The “wherePageContent.py” script is a dummy, which simply tells you where to find getConfluencePageContent.py, which I wrote for a different purpose and which works well here too. (We need content re-use on Bitbucket!)

3. findAttachmentUsage.py: Reads a text file containing attachment file names, matches them against the source of Confluence pages, and produces a report on used and unused attachments.

4. deleteAttachments.py: Reads a text file containing attachment file names, accepts a Confluence page name, and removes the given attachments from the page.

Note: To run scripts 1, 2 and 4 successfully, you need access to Confluence, and the Confluence remote API must be enabled. Script 3 does all its work in text files. It’s like greased lightning.  🙂

So, in my use case, how many of the attachments are actually used?

71

That’s right. Of the 396 attachments on the “space attachments directory” page, only 71 are still in use. The other 325 are taking up space on our documentation wiki, taking up space in our XML exports, and slowing down our processes when we copy the Confluence documentation to the OnDemand space.

What’s next?

After some final testing, I’ll run the scripts on our production wiki next week. The first candidate is the Space Attachments Directory page. We’ll look at other pages that have a large number of attachments too.

The findAttachmentUsage.py script produces a cross-referenced list of matched attachments and the pages that reference them. We may use that cross-reference to decide whether we want to retain the “space attachments directory”. We may decide instead to move all the attachments to the pages where they’re used, and remove the shared page.

How to run the Python scripts

New to Python? It’s fun, and remarkably easy. This earlier post describes how to download and use Python: Confluence full-text search using Python and grep. There’s more about Python, and some interesting comments from readers, on this post: Python as a useful tool for technical writers.

About Sarah Maddox

Technical writer, author and blogger in Sydney

Posted on 2 June 2013, in Confluence, technical writing and tagged , , , , , . Bookmark the permalink. 6 Comments.

  1. Awesome tip, thank you Ms Maddox!

  2. Thank you very much for your scripts – one of our spaces is much lighter since I hacked the scripts to process a whole space at once and deleted lots and lots of unused attachments!

  3. Thank you for your awesome scripts!

    Just want to say, that in new versions of Confluence variable “search_string” in findAttachmentUsage.py should be changed from “search_string = ‘<ri:page ri:content-title="' + input_pagename" to
    "search_string = '’.format(attachment_name)”

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.