Confluence full-text search using Python and grep
The standard search in Confluence wiki searches the visible content of the page. It also offers keywords for some specific searches, such as macro names and page titles. But sometimes we need to find things that the search cannot find, because the content of the relevant XML elements is not indexed. This post offers a solution of sorts: Copy the XML storage format of your pages into text files on your local machine, then use a powerful search like grep
to do the work.
Here are some examples of the problem:
- We may want to find all pages that reference a certain image, or other attachment. It’s easy enough to find the page(s) where the image is attached. But it’s not possible to find all pages that display a given image which is attached to another page.
- It’s possible to search for all occurrences of a macro name, using the
macroName:
keyword in the search. But it’s not possible to search for parameter values. This means, for example, you can’t search for all pages that include content from a given page.
I’ve written a script to solve the problem, by downloading the storage format from Confluence onto your local machine, where you can use all sorts of powerful text searches. You’re welcome to use the script, with the proviso that it’s not perfect.
Python script: getConfluencePageContent
The script is in a repository on Bitbucket: https://bitbucket.org/sarahmaddox/confluence-full-text-search.
Note: To run the script successfully, you need access to Confluence, and the Confluence remote API must be enabled.
Installing Python
To run the script, you need to install Python. The scripts are designed for Python 3, not Python 2. There were fairly significant changes in Python 3.
- Download Python 3.2.3 or later: http://www.python.org/getit/
(I downloadedpython-3.2.3.amd64.msi
, because I’m working on a 64-bit Windows machine.) - Run the installer to install Python on your computer.
(I left all the options at their default values.) - Add the location of your Python installation to your path variable in Windows:
- Go to ‘Start’ > ‘Control Panel’ > ‘System’ > ‘Advanced system settings’
- Click ‘Environment Variables’.
- In the ‘System variables’ section, select ‘Path’.
- Click ‘Edit’.
- Add the following to the end of the path, assuming that you installed Python in the default location:
;C:\Python32
- Click ‘OK’ three times.
- Open a command window and type ‘python’ to see if all is OK. You should see something like this:
Getting the script
Go to the Bitbucket repository and choose ‘Downloads’ > ‘Branches’, then download the zip file and unzip it into a directory on your computer.
Running the script to get the content of your pages
To use the getConfluencePageContent
script:
- Enable the remote API (XML-RPC & SOAP) on your Confluence site.
- Open the
getConfluencePageContent
script in Python’s ‘IDLE’ GUI. (Right-click on the script and choose ‘Edit with IDLE’.) - Run the script from within IDLE. (Press F5.)
- The Python shell will open and prompt you for some information:
- Confluence URL – The base URL of your Confluence site. If the site uses SSL, enter ‘HTTPS’ instead of ‘HTTP’. For example:
https://my.confluence.com
- Username – Confluence will use this username to access the pages. This username must have ‘view’ access to all the spaces and pages that you want to check.
- Password – The password for the above username.
- Space key – A Confluence space key. Case is not important – the match is not case-sensitive.
- Output directory name – The directory where the script should put its results. The script will create this directory. Make sure it does not yet exist.
- Confluence URL – The base URL of your Confluence site. If the site uses SSL, enter ‘HTTPS’ instead of ‘HTTP’. For example:
- Look for the output directory as a sibling of the directory that contains the
getConfluencePageContent
script. In other words, the output directory will appear in your file system at the same level as the script’s directory.
Output of the script
The Bitbucket repository contains an example of the output, based on the Demonstration space shipped with Confluence. See the outputexample
directory in the repository. For example, this file contains the content of the page titled ‘Welcome to Confluence’.
The script gets the content of all pages in the given Confluence space. It puts the content of each page into a separate text file, in a given directory.
The script creates the output directory as a sibling of the directory that contains the getConfluencePageContent
script. In other words, the output directory will appear in your file system at the same level as the script’s directory.
The file name is a combination of the page name and page ID. To prevent problems when creating the files, the script removes all non-alphanumeric characters from the file name. To ensure uniqueness, it appends the page ID to the page name when creating the file name.
The content is in the form of the Confluence storage format, which is a type of XML consisting of HTML with Confluence-specific elements. (Docs.)
The script also writes a line at the top of each file, containing the URL of the page, and marked with asterisks for easy grepping.
Notes:
- The script will show an error if the output directory already exists.
- If you see the following error message, you need to enable the remote API (XML-RPC & SOAP) on your Confluence site: xmlrpc.client.ProtocolError: <ProtocolError for localhost:8090/rpc/xmlrpc: 403 Forbidden>
Grep and winGrep
Now that you have the page content in text form, the world’s your oyster. 🙂 You can use the full power of text search tools. If you’re on UNIX, you’ll already know about grep.
If you’re on Windows, let me introduce grepWin. It’s a free, powerful search tool that you can install on Windows. It offers regular expression (regexp) searches as well as standard searches, and it has a very nice UI (user interface).
This screenshot shows a search for an image called ‘step-2-image-1-confluence-demo-space.png’. The image is attached to one page, and referenced in two pages. QED. 😀
Comments welcome!
I’d love to know if you think you’ll find the script useful, and if you have any ideas for improving it.
Posted on 20 April 2013, in Confluence, technical writing, wiki and tagged Confluence, grep, grepWin, Python, search, technical communication, technical writing, wiki. Bookmark the permalink. 3 Comments.
Thank you for your research on this topic and for sharing your knowledge. I’ve been able to do so much more in Confluence with scripts (and so much faster!). Python is a great tool, and I’m glad to have stumbled upon your article. Thanks again!
Hallo Kaitlyn
You made my day. 🙂 I’m so glad the scripts are helpful. I found that knowing how to interact with Confluence via scripts really made the wiki fly. Even when these scripts get out of date, at least people will see what’s possible and that gives them licence to fly too!
Thanks so much for commenting!
Cheers
Sarahn
Pingback: How to manage attachment usage in Confluence wiki with some Python scripts | ffeathers