How to find duplicate page names across Confluence spaces

Technical writers sometimes need to check for duplicate page names across different spaces in Confluence wiki. For example, our team is planning to put copies of all the product documentation (JIRA, Confluence, Bamboo, FishEye and Crucible) into the Atlassian OnDemand documentation space. But you can’t have two pages with the same name in the same space. So you need a way of finding duplicate page names.

Here’s another use case: Last week a customer emailed me saying that he has to merge 9 spaces, and he needs to find out whether he has any duplicate pages that would break the process.

I’ve written a couple of scripts to solve the problem. You’re welcome to use them, with the proviso that they’re not perfect.

Enter the pagelister and pageduptest Python scripts

The scripts are in a repository on Bitbucket: https://bitbucket.org/sarahmaddox/check-duplicate-pages.

  • pagelister.py – Accesses Confluence via the remote API, and lists all the pages in a given set of Confluence spaces. It puts the page names and space keys into a text file in the format required by the pageduptest script.
  • pageduptest.py – Checks a text file for duplicate page names, and spits out the offending page names and space keys.

More about pageduptest

The pageduptest script reads a text file containing Confluence space keys and page names, and reports on duplicate page names. The script assumes an input text file of a specific format.

To produce the input text file, you can do one of the following:

  • Option 1: Use a Children macro on a Confluence page, to list all the pages in your space. Copy the page names and paste them into a text file. This is handy for people who do not have access to the Confluence remote API.
  • Option 2: Use the pagelister.py script to list all the pages in a given set of Confluence spaces. It puts the page names and space keys into a text file in the format required by the pageduptest script.

Note: You don’t need access to Confluence, nor the Confluence remote API, to run the pageduptest script. But you do need it for the pagelister script.

Overview of the input file

The pageduptest script expects a file that contains a list of space keys and page names.

  • The space key is on a separate line at the start of each set of page names. The line for the space key starts with “Spacekey=“.
  • Each page name is on a separate line.

Example illustrating the format of the file:

Spacekey=DOC
This is the name of a page
This is the name of page BB
How to eat a chocolate
Spacekey=JIRA
This is the name of page BB
This is the name of page D
My page F
Spacekey=FISHEYE
This is the name of page BBB
Talking about pages
My page F

Installing Python

To run the scripts, you need to install Python. The scripts are designed for Python 3, not Python 2. There were fairly significant changes in Python 3.

  1. Download Python 3.2.3 or later: http://www.python.org/getit/
    (I downloaded python-3.2.3.amd64.msi, because I’m working on a 64-bit Windows machine.)
  2. Run the installer to install Python on your computer.
    (I left all the options at their default values.)
  3. Add the location of your Python installation to your path variable in Windows:
    1. Go to ‘Start’ > ‘Control Panel’ > ‘System’ > ‘Advanced system settings’
    2. Click ‘Environment Variables’.
    3. In the ‘System variables’ section, select ‘Path’.
    4. Click ‘Edit’.
    5. Add the following to the end of the path, assuming that you installed Python in the defaul location:
      ;C:\Python32
    6. Click ‘OK’ three times.
    7. Open a command window and type ‘python’ to see if all is OK. You should see something like this:

Getting the scripts

Go to the Bitbucket repository and choose ‘get source’, then download the zip file and unpack it into a directory on your computer.

Running the pagelister script to build the list of pages to check

As mentioned before, you can either use a Children macro, or run the pagelister.py script, to produce the input text file required by the pageduptest script.

To use the pagelister script:

  1. Enable the remote API (XML-RPC & SOAP) on your Confluence site.
  2. Open the pagelister.py script in Python’s ‘IDLE’ GUI.  (Right-click on the script and choose ‘Edit with IDLE’.)
  3. Run the script from within IDLE. (Press F5.)
  4. The Python shell will open and prompt you for some information:
    • Confluence URL – The base URL of your Confluence site. If the site uses SSL, enter ‘HTTPS’ instead of ‘HTTP’. For example: https://my.confluence.com
    • Username – Confluence will use this username to access the pages. This username must have ‘view’ access to all the spaces and pages that you want to check.
    • Password – The password for the above username.
    • Space keys – A comma-separated list of Confluence space keys. Do not use spaces between the commas. Case is not important – the match is not case-sensitive. For example: doc,choc,ds.
    • Output file name – The name of the text file where pagelister should put its results. If this file already exists, pagelister will overwrite the content.
  5. Look for the output file in the same directory as the pagelister.py script.

pagelister.py

Running the pageduptest script to find duplicate page names

Now you can use the pageduptest script to find duplicate pages in your input text file:

  1. Open the pageduptest.py script in Python’s ‘IDLE’ GUI.  (Right-click on the script and choose ‘Edit with IDLE’.)
  2. Run the script from within IDLE. (Press F5.)
  3. The Python shell will open, and prompt you for some information:
    • Input file name – The name of the text file that contains the space keys and pages to be checked.
    • Output file name – The name of the text file where pageduptest should put its results. If this file already exists, pageduptest will overwrite the content.
  4. Look for the output file in the same directory as the pageduptest.py script.

pageduptest.py

Notes:

  • To ensure that the duplicate test is case insensitive, the pageduptest script converts all page names to upper case before doing the comparison.
  • I haven’t explicitly tested the comparison with page names that contain numbers or special characters.
  • If you see the following error message when running the pagelister script, then you need to enable the remote API (XML-RPC & SOAP) on your Confluence site: “xmlrpc.client.ProtocolError: <ProtocolError for localhost:8090/rpc/xmlrpc: 403 Forbidden>”

Improvements to the scripts

The scripts are pretty much a hack that I’ve put together to solve a problem. I’m blogging about them because I reckon other people will find them useful.

There are things I could do to improve the scripts. :) I’ve thought about adding a “superdupchecker” script that does everything
all in one go: Get the pages from Confluence, check for duplicates, and then spit out the list of duplicate pages plus their URLs.

Let me know if you use the above scripts, and if you would find a “superdupchecker” useful!

About these ads

About Sarah Maddox

Technical writer, author and blogger in Sydney

Posted on 28 July 2012, in Confluence, technical writing, wiki and tagged , , , , . Bookmark the permalink. 3 Comments.

  1. Hi Sarah,

    I would like to use your script, but the TLS connect fails. According package dump is not transmitted session ID. Did you perhaps a solution for this?

    Bye Jürgen

    • Hallo Jürgen,
      I’m sorry, I don’t know the answer to that question. I think Python and SSL connections can be a bit tricky. I’m guessing you’re using an HTTPS connection. You could try searching StackOverflow or other forums to see if other people have encountered similar problems in a similar environment to yours. For example, this thread talks about SSL and Python under Ubuntu. That’s about the limit of my knowledge here. :)
      Cheers
      Sarah

  2. Thank you Sarah for your answer even if does not help me;) Greetings to Down Under

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,494 other followers

%d bloggers like this: