Blog Archives

How to manage attachment usage in Confluence wiki with some Python scripts

Do you need to find out whether the attachments on a Confluence wiki page are used anywhere in the space? Having discovered they’re not, do you want to delete them from the page? I’m hoping this post will help.

The Confluence user interface doesn’t offer the option to delete attachments in bulk. Nor does it offer any way of cross-referencing attachment usage. You can’t get a list of attachments and find out where they’re used. So, I’ve written four Python scripts that you can run consecutively to do the following:

  • Get a list of all attachments on a given page.
  • Get the content of all pages in a given space.
  • Produce two reports, one listing the attachments that are not referenced anywhere in the space, and the second showing the attachments that are referenced and the pages that use them.
  • Accept a list of attachment names and delete them from a given page.

Our use case

In the Confluence documentation we have a page called Space Attachments Directory. It’s been there for yonks. It has an enormous number of screenshots attached to it (396, to be precise). The page was created in 2005, with the aim of storing screenshots that can be re-used on various pages. A good aim in principle, but in practice unmanageable when applied across a large space maintained by many authors. Various technical writers over the years have either used or not used this page and its attachments.

As a result, we didn’t know how many of the attachments are actually used anywhere in the space. I suspected that only a few of the attachments were still in use.

Python to the rescue.

The scripts

The four Python scripts are available on Bitbucket. Please feel free to download and use them. If you have any suggestions for improvement, I’d love to hear them.

A friendly warning: These scripts are provided “as is” and without any guarantees. I developed them to solve a specific problem. I’m sharing them because I hope they will be useful to others too. If you have any improvements to share, please let me know.

1. getConfluencePageAttachments.py: Gets all attachments on a given Confluence page. It puts the list of attachments into a text file, and prints a report of the number of attachments and total file size.

2. getConfluencePageContent.py: Gets the content of all pages in a given Confluence space. It puts the content of each page into a separate text file, in a given directory. The content is in the form of the Confluence “storage format”, which is a type of XML consisting of HTML with Confluence-specific elements. A note for the curious: The “wherePageContent.py” script is a dummy, which simply tells you where to find getConfluencePageContent.py, which I wrote for a different purpose and which works well here too. (We need content re-use on Bitbucket!)

3. findAttachmentUsage.py: Reads a text file containing attachment file names, matches them against the source of Confluence pages, and produces a report on used and unused attachments.

4. deleteAttachments.py: Reads a text file containing attachment file names, accepts a Confluence page name, and removes the given attachments from the page.

Note: To run scripts 1, 2 and 4 successfully, you need access to Confluence, and the Confluence remote API must be enabled. Script 3 does all its work in text files. It’s like greased lightning.Ā  šŸ™‚

So, in my use case, how many of the attachments are actually used?

71

That’s right. Of the 396 attachments on the “space attachments directory” page, only 71 are still in use. The other 325 are taking up space on our documentation wiki, taking up space in our XML exports, and slowing down our processes when we copy the Confluence documentation to the OnDemand space.

What’s next?

After some final testing, I’ll run the scripts on our production wiki next week. The first candidate is the Space Attachments Directory page. We’ll look at other pages that have a large number of attachments too.

The findAttachmentUsage.py script produces a cross-referenced list of matched attachments and the pages that reference them. We may use that cross-reference to decide whether we want to retain the “space attachments directory”. We may decide instead to move all the attachments to the pages where they’re used, and remove the shared page.

How to run the Python scripts

New to Python? It’s fun, and remarkably easy. This earlier post describes how to download and use Python: Confluence full-text search using Python andĀ grep. There’s more about Python, and some interesting comments from readers, on this post: Python as a useful tool for technicalĀ writers.

Python as a useful tool for technical writers

Every now and then, and perhaps particularly so when working on a wiki, we technical writers need to manipulate our content in some way that’s not provided by our content management system. A few times recently, I’ve dabbled with Python to solve some problems. Do you often find the need to wrangle your content outside your CMS, and do you use Python or another scripting tool?

Python is a scripting language. It’s easy to learn, especially if you’ve done some programming in other languages. It’s just the ticket for data manipulation. ItĀ also offers a number of useful libraries. For example:

  • There are various libraries that you can use to access a web application via a SOAP or an XML-RPC remote API. I use the “xmlrpc.client” library in a few scripts, to get access to Confluence data.
  • The “os” library is useful for creating directories on the local file system of the computer youā€™re running on. For example, I use it to create a directory for the scriptā€™s output file.
  • The “re” library offers regular expression functions.

A script to find duplicate page names across Confluence spaces

This was the first Python script that I wrote to wrangle Confluence data. I started with a specific problem: I had five text files, each containing a list of page names. These were the pages in five Confluence spaces, that we needed to copy into another, single space. The problem is that Confluence does not allow duplicate page names within a space. SoĀ I needed to check my lists for matching page names.

I hacked together a Python script that checked for duplicate page names. TheĀ script reads a text file containing Confluence space keys and page names, and reports on duplicate page names.Ā My first script used nested lists to store and compare the page names.Ā A kind Atlassian developer reviewed the script and suggested I use a dictionary instead. So I did. A dictionary stores data in key-value pairs. MuchĀ neater!

Then I thought: Some people may not have their page names in a handy text file. They may want to get a list of all pages in aĀ Confluence space.Ā So I wrote a script toĀ get the names of all pages in a given set of Confluence spaces.

The details of the scripts are in this post:Ā How to find duplicate page names across Confluence spaces.

A script to get the source code of all pages in a Confluence space, for a full-text search

The search functionality in the Confluence web interface will return results from the visible content of the page, but it cannot get inside the XML-like elements that make up the Confluence storage format. For example, it’s not possible toĀ find all pages that reference a certain image. And you can’tĀ search for macro parameter values. This means, for example, you canā€™t search for all pages that include content from a given page.

Just recently I wrote a script thatĀ gets the XML storage format of all pages in a given Confluence space, and puts the code into text files on your local machine. Then you can use a powerful full-text search like grep, to find what you need. The details are in this post:Ā Confluence full-text search using Python and grep

More on the way

Iā€™m currently writing a couple more Python scripts to solve another problem. Iā€™ll blog about it when Iā€™ve finished.

Resources

If you’re interesting in Python, here are some links you many find useful:

A chuckle, courtesy of the Python technical writers

From the Python documentation:

By the way, the language is named after the BBC show ā€œMonty Pythonā€™s Flying Circusā€ and has nothing to do with reptiles. Making referencesĀ to Monty Python skits in documentation is not only allowed, it is encouraged!

Probably not a python

Last week I was lucky enough to be in New Orleans in the USA. I went on a tour of the Honey Island swamp, and saw this snakeĀ coiled comfortably on a tree trunk. Iā€™m not sure what type of snake it is. Maybe a Copperhead:

Python for technical writers

What do you use?

Do you often use Python or some other scripting tool to automate those pesky tasks your CMS can’t handle?

Confluence full-text search using Python and grep

The standard search in Confluence wiki searches the visible content of the page. It also offers keywords for some specific searches, such as macro names and page titles. But sometimes we need to find things that the searchĀ  cannot find, because the content of the relevant XML elements is not indexed. This post offers a solution of sorts: Copy the XML storage format of your pages into text files on your local machine, then use a powerful search like grep to do the work.

Here are some examples of the problem:

  • We may want to find all pages that reference a certain image, or other attachment. It’s easy enough to find the page(s) where the image is attached. But it’s not possible to find all pages that display a given image which is attached to another page.
  • It’s possible to search for all occurrences of a macro name, using the macroName: keyword in the search. But it’s not possible to search for parameter values. This means, for example, you can’t search for all pages that include content from a given page.

I’ve written a script to solve the problem, by downloading the storage format from Confluence onto your local machine, where you can use all sorts of powerful text searches. You’re welcome to use the script, with the proviso that it’s not perfect.

Python script: getConfluencePageContent

The script is in a repository on Bitbucket: https://bitbucket.org/sarahmaddox/confluence-full-text-search.

Note: To run the script successfully, you need access to Confluence, and the Confluence remote API must be enabled.

Installing Python

To run the script, you need to install Python. The scripts are designed for Python 3, not Python 2. There were fairly significant changes in Python 3.

  1. Download Python 3.2.3 or later: http://www.python.org/getit/
    (I downloaded python-3.2.3.amd64.msi, because I’m working on a 64-bit Windows machine.)
  2. Run the installer to install Python on your computer.
    (I left all the options at their default values.)
  3. Add the location of your Python installation to your path variable in Windows:
    1. Go to ‘Start’ > ‘Control Panel’ > ‘System’ > ‘Advanced system settings’
    2. Click ‘Environment Variables’.
    3. In the ‘System variables’ section, select ‘Path’.
    4. Click ‘Edit’.
    5. Add the following to the end of the path, assuming that you installed Python in the default location:
      ;C:\Python32
    6. Click ‘OK’ three times.
    7. Open a command window and type ‘python’ to see if all is OK. You should see something like this:

Confluence full-text search using Python and grep

Getting the script

Go to the Bitbucket repository and choose ‘Downloads’ > ‘Branches’, then download the zip file and unzip it into a directory on your computer.

Running the script to get the content of your pages

To use the getConfluencePageContent script:

  1. Enable the remote API (XML-RPC & SOAP) on your Confluence site.
  2. Open theĀ getConfluencePageContent script in Python’s ‘IDLE’ GUI.Ā  (Right-click on the script and choose ‘Edit with IDLE’.)
  3. Run the script from within IDLE. (Press F5.)
  4. The Python shell will open and prompt you for some information:
    • Confluence URL ā€“ The base URL of your Confluence site. If the site uses SSL, enter ‘HTTPS’ instead of ‘HTTP’. For example: https://my.confluence.com
    • Username ā€“ Confluence will use this username to access the pages. This username must have ‘view’ access to all the spaces and pages that you want to check.
    • Password ā€“ The password for the above username.
    • Space key ā€“ A Confluence space key. Case is not important ā€“ the match is not case-sensitive.
    • Output directory name ā€“ The directory where the script should put its results. The script will create this directory. Make sure it does not yet exist.
  5. Look for the output directory as a sibling of the directory that contains the getConfluencePageContent script. In other words, the output directory will appear in your file system at the same level as the script’s directory.
Python Shell

Python shell (IDLE)

 

Output of the script

The Bitbucket repository contains an example of the output, based on the Demonstration space shipped with Confluence. See the outputexample directory in the repository. For example, this file contains the content of the page titled ‘Welcome to Confluence’.

The script gets the content of all pages in the given Confluence space. It puts the content of each page into a separate text file, in a given directory.

The script creates the output directory as a sibling of the directory that contains the getConfluencePageContent script. In other words, the output directory will appear in your file system at the same level as the script’s directory.

The file name is a combination of the page name and page ID. To prevent problems when creating the files, the script removes all non-alphanumeric characters from the file name. To ensure uniqueness, it appends the page ID to the page name when creating the file name.

The content is in the form of the Confluence storage format, which is a type of XML consisting of HTML with Confluence-specific elements. (Docs.)

The script also writes a line at the top of each file, containing the URL of the page, and marked with asterisks for easy grepping.

Notes:

  • The script will show an error if the output directory already exists.
  • If you see the following error message, you need to enable the remote API (XML-RPC & SOAP) on your Confluence site: xmlrpc.client.ProtocolError: <ProtocolError for localhost:8090/rpc/xmlrpc: 403 Forbidden>

Grep and winGrep

Now that you have the page content in text form, the world’s your oyster. šŸ™‚ You can use the full power of text search tools. If you’re on UNIX, you’ll already know about grep.

If you’re on Windows, let me introduce grepWin. It’s a free, powerful search tool that you can install on Windows. It offers regular expression (regexp) searches as well as standard searches, and it has a very nice UI (user interface).

This screenshot shows a search for an image called ‘step-2-image-1-confluence-demo-space.png’. The image is attached to one page, and referenced in two pages. QED. šŸ˜€

grepWin

grepWin

 

Comments welcome!

I’d love to know if you think you’ll find the script useful, and if you have any ideas for improving it.

How to find duplicate page names across Confluence spaces

Technical writers sometimes need to check for duplicate page names across different spaces in Confluence wiki. For example, our team is planning to put copies of all the product documentation (JIRA, Confluence, Bamboo, FishEye and Crucible) into the Atlassian OnDemand documentation space. But you can’t have two pages with the same name in the same space. So you need a way of finding duplicate page names.

Here’s another use case: Last week a customer emailed me saying that he has to merge 9 spaces, and he needs to find out whether he has any duplicate pages that would break the process.

I’ve written a couple of scripts to solve the problem. You’re welcome to use them, with the proviso that they’re not perfect.

Enter the pagelister and pageduptest Python scripts

The scripts are in a repository on Bitbucket: https://bitbucket.org/sarahmaddox/check-duplicate-pages.

  • pagelister.pyĀ ā€“ Accesses Confluence via the remote API, and lists all the pages in a given set of Confluence spaces. It puts the page names and space keys into a text file in the format required by theĀ pageduptest script.
  • pageduptest.py ā€“ Checks a text file for duplicate page names, and spits out the offending page names and space keys.

More about pageduptest

TheĀ pageduptest script reads a text file containing Confluence space keys and page names, and reports on duplicate page names. The script assumes an input text file of a specific format.

To produce the input text file, you can do one of the following:

  • Option 1: Use aĀ Children macro on a Confluence page, to list all the pages in your space. Copy the page names and paste them into a text file. This is handy for people who do not have access to the Confluence remote API.
  • Option 2: Use the pagelister.py script to list all the pages in a given set of Confluence spaces. It puts the page names and space keys into a text file in the format required by theĀ pageduptest script.

Note: You don’t need access to Confluence, nor the Confluence remote API, to run theĀ pageduptest script. But you do need it for theĀ pagelister script.

Overview of the input file

The pageduptest script expects a file that contains a list of space keys and page names.

  • The space key is on a separate line at the start of each set of page names. The line for the space key starts with “Spacekey=“.
  • Each page name is on a separate line.

Example illustrating the format of the file:

Spacekey=DOC
This is the name of a page
This is the name of page BB
How to eat a chocolate
Spacekey=JIRA
This is the name of page BB
This is the name of page D
My page F
Spacekey=FISHEYE
This is the name of page BBB
Talking about pages
My page F

Installing Python

To run the scripts, you need to install Python. The scripts are designed for Python 3, not Python 2. There were fairly significant changes in Python 3.

  1. Download Python 3.2.3 or later: http://www.python.org/getit/
    (I downloaded python-3.2.3.amd64.msi, because I’m working on a 64-bit Windows machine.)
  2. Run the installer to install Python on your computer.
    (I left all the options at their default values.)
  3. Add the location of your Python installation to your path variable in Windows:
    1. Go to ‘Start’ > ‘Control Panel’ > ‘System’ > ‘Advanced system settings’
    2. Click ‘Environment Variables’.
    3. In the ‘System variables’ section, select ‘Path’.
    4. Click ‘Edit’.
    5. Add the following to the end of the path, assuming that you installed Python in the defaul location:
      ;C:\Python32
    6. Click ‘OK’ three times.
    7. Open a command window and type ‘python’ to see if all is OK. You should see something like this:

Getting the scripts

Go to the Bitbucket repository and choose ‘get source’, then download the zip file and unpack it into a directory on your computer.

Running theĀ pagelister script to build the list of pages to check

As mentioned before, you can either use a Children macro, or run theĀ pagelister.py script, to produce the input text file required by the pageduptest script.

To use the pagelister script:

  1. Enable the remote API (XML-RPC & SOAP) on your Confluence site.
  2. Open theĀ pagelister.py script in Python’s ‘IDLE’ GUI.Ā  (Right-click on the script and choose ‘Edit with IDLE’.)
  3. Run the script from within IDLE. (Press F5.)
  4. The Python shell will open and prompt you for some information:
    • Confluence URL ā€“ The base URL of your Confluence site. If the site uses SSL, enter ‘HTTPS’ instead of ‘HTTP’. For example: https://my.confluence.com
    • Username ā€“ Confluence will use this username to access the pages. This username must have ‘view’ access to all the spaces and pages that you want to check.
    • Password ā€“ The password for the above username.
    • Space keys ā€“ A comma-separated list of Confluence space keys. Do not use spaces between the commas. Case is not important ā€“ the match is not case-sensitive. For example: doc,choc,ds.
    • Output file name ā€“ The name of the text file where pagelister should put its results. If this file already exists, pagelister will overwrite the content.
  5. Look for the output file in the same directory as the pagelister.py script.

pagelister.py

Running theĀ pageduptest script to find duplicate page names

Now you can use theĀ pageduptest scriptĀ to find duplicate pages in your input text file:

  1. Open theĀ pageduptest.py script in Python’s ‘IDLE’ GUI.Ā  (Right-click on the script and choose ‘Edit with IDLE’.)
  2. Run the script from within IDLE. (Press F5.)
  3. The Python shell will open, and prompt you for some information:
    • Input file name ā€“ The name of the text file that contains the space keys and pages to be checked.
    • Output file name ā€“ The name of the text file whereĀ pageduptest should put its results. If this file already exists, pageduptest will overwrite the content.
  4. Look for the output file in the same directory as the pageduptest.py script.

pageduptest.py

Notes:

  • To ensure that the duplicate test is case insensitive, the pageduptest script converts all page names to upper case before doing the comparison.
  • I haven’t explicitly tested the comparison with page names that contain numbers or special characters.
  • If you see the following error message when running the pagelister script, then you need to enable the remote API (XML-RPC & SOAP) on your Confluence site: “xmlrpc.client.ProtocolError: <ProtocolError for localhost:8090/rpc/xmlrpc: 403 Forbidden>”

Improvements to the scripts

The scripts are pretty much a hack that I’ve put together to solve a problem. I’m blogging about them because I reckon other people will find them useful.

There are things I could do to improve the scripts. šŸ™‚ I’ve thought about adding a “superdupchecker” script that does everything
all in one go: Get the pages from Confluence, check for duplicates, and then spit out the list of duplicate pages plus their URLs.

Let me know if you use the above scripts, and if you would find a “superdupchecker” useful!

%d bloggers like this: