Blog Archives
Confluence full-text search using Python and grep
The standard search in Confluence wiki searches the visible content of the page. It also offers keywords for some specific searches, such as macro names and page titles. But sometimes we need to find things that the search cannot find, because the content of the relevant XML elements is not indexed. This post offers a solution of sorts: Copy the XML storage format of your pages into text files on your local machine, then use a powerful search like grep
to do the work.
Here are some examples of the problem:
- We may want to find all pages that reference a certain image, or other attachment. It’s easy enough to find the page(s) where the image is attached. But it’s not possible to find all pages that display a given image which is attached to another page.
- It’s possible to search for all occurrences of a macro name, using the
macroName:
keyword in the search. But it’s not possible to search for parameter values. This means, for example, you can’t search for all pages that include content from a given page.
I’ve written a script to solve the problem, by downloading the storage format from Confluence onto your local machine, where you can use all sorts of powerful text searches. You’re welcome to use the script, with the proviso that it’s not perfect.
Python script: getConfluencePageContent
The script is in a repository on Bitbucket: https://bitbucket.org/sarahmaddox/confluence-full-text-search.
Note: To run the script successfully, you need access to Confluence, and the Confluence remote API must be enabled.
Installing Python
To run the script, you need to install Python. The scripts are designed for Python 3, not Python 2. There were fairly significant changes in Python 3.
- Download Python 3.2.3 or later: http://www.python.org/getit/
(I downloadedpython-3.2.3.amd64.msi
, because I’m working on a 64-bit Windows machine.) - Run the installer to install Python on your computer.
(I left all the options at their default values.) - Add the location of your Python installation to your path variable in Windows:
- Go to ‘Start’ > ‘Control Panel’ > ‘System’ > ‘Advanced system settings’
- Click ‘Environment Variables’.
- In the ‘System variables’ section, select ‘Path’.
- Click ‘Edit’.
- Add the following to the end of the path, assuming that you installed Python in the default location:
;C:\Python32
- Click ‘OK’ three times.
- Open a command window and type ‘python’ to see if all is OK. You should see something like this:
Getting the script
Go to the Bitbucket repository and choose ‘Downloads’ > ‘Branches’, then download the zip file and unzip it into a directory on your computer.
Running the script to get the content of your pages
To use the getConfluencePageContent
script:
- Enable the remote API (XML-RPC & SOAP) on your Confluence site.
- Open the
getConfluencePageContent
script in Python’s ‘IDLE’ GUI. (Right-click on the script and choose ‘Edit with IDLE’.) - Run the script from within IDLE. (Press F5.)
- The Python shell will open and prompt you for some information:
- Confluence URL – The base URL of your Confluence site. If the site uses SSL, enter ‘HTTPS’ instead of ‘HTTP’. For example:
https://my.confluence.com
- Username – Confluence will use this username to access the pages. This username must have ‘view’ access to all the spaces and pages that you want to check.
- Password – The password for the above username.
- Space key – A Confluence space key. Case is not important – the match is not case-sensitive.
- Output directory name – The directory where the script should put its results. The script will create this directory. Make sure it does not yet exist.
- Confluence URL – The base URL of your Confluence site. If the site uses SSL, enter ‘HTTPS’ instead of ‘HTTP’. For example:
- Look for the output directory as a sibling of the directory that contains the
getConfluencePageContent
script. In other words, the output directory will appear in your file system at the same level as the script’s directory.
Output of the script
The Bitbucket repository contains an example of the output, based on the Demonstration space shipped with Confluence. See the outputexample
directory in the repository. For example, this file contains the content of the page titled ‘Welcome to Confluence’.
The script gets the content of all pages in the given Confluence space. It puts the content of each page into a separate text file, in a given directory.
The script creates the output directory as a sibling of the directory that contains the getConfluencePageContent
script. In other words, the output directory will appear in your file system at the same level as the script’s directory.
The file name is a combination of the page name and page ID. To prevent problems when creating the files, the script removes all non-alphanumeric characters from the file name. To ensure uniqueness, it appends the page ID to the page name when creating the file name.
The content is in the form of the Confluence storage format, which is a type of XML consisting of HTML with Confluence-specific elements. (Docs.)
The script also writes a line at the top of each file, containing the URL of the page, and marked with asterisks for easy grepping.
Notes:
- The script will show an error if the output directory already exists.
- If you see the following error message, you need to enable the remote API (XML-RPC & SOAP) on your Confluence site: xmlrpc.client.ProtocolError: <ProtocolError for localhost:8090/rpc/xmlrpc: 403 Forbidden>
Grep and winGrep
Now that you have the page content in text form, the world’s your oyster. 🙂 You can use the full power of text search tools. If you’re on UNIX, you’ll already know about grep.
If you’re on Windows, let me introduce grepWin. It’s a free, powerful search tool that you can install on Windows. It offers regular expression (regexp) searches as well as standard searches, and it has a very nice UI (user interface).
This screenshot shows a search for an image called ‘step-2-image-1-confluence-demo-space.png’. The image is attached to one page, and referenced in two pages. QED. 😀
Comments welcome!
I’d love to know if you think you’ll find the script useful, and if you have any ideas for improving it.
Want an XML schema viewer in Confluence wiki?
You got it. 🙂 Avisi have developed two nifty macros to display an XML schema (XSD) in tabular and graphic format on a Confluence page. The XSD Viewer is a new add-on for Confluence wiki, and the Avisi developers are keen for input from technical writers and others interested in XML schemas.
I’ve been playing around with the add-on, so I’d love to show you a couple of examples and tell you how to get it working for yourself. I’ve also chatted with Yanne from Avisi, who says that he and his team would love to have your feedback.
Example 1: A purchase order schema
I’ve grabbed the sample schema for a purchase order from MSDN: http://msdn.microsoft.com/en-us/library/ms256129.aspx. I’ve instructed the XSD viewer to start with the purchaseOrder
element, and show a depth of 2 levels.
Example 2: Graham Hannington’s schema for the Confluence storage format
Hehe, if you put Confluence and XSD in the same blog post, then ‘twould be remiss not to include Graham’s XML schema for the Confluence storage format. 😀
The XSD Viewer is using confluence.xsd
, starting with the image
element.
One point of interest here is that the confluence.xsd
file references two other schema files: confluence-ri.xsd
and confluence-xhtml.xsd
. All I had to do to make this work, was to attach all three XSD files to the page. This screenshot shows the attachments on the above page:
Hiccups
A couple of times, the XSD Viewer has declined to show any rows in the table. I’m not sure why this occurs. If it happens to you too, it’s worth letting the Avisi team know.
My environment
I’m using Confluence 5.0.1, with version 1.1.1 of the XSD Viewer. I’m running Confluence on my Windows 7 laptop, and I’m using Chrome to view the wiki pages.
How to get your own XSD viewer
To make this happen, you need to do the following:
- Download and install Confluence, if you don’t already have it. You can try it for free for 30 days. See the Confluence download page.
- Download the XSD Viewer add-on and install it into Confluence. The add-on is also available for free for 30 days. See the XSD Viewer page on the Atlassian Marketplace.
- Create a page in Confluence.
- Attach your XSD file to the Confluence page, just as you would attach a screenshot or other file. See the documentation on adding attachments.
- Edit the page.
- Add the “XSD Image” and/or the “XSD Table” macros to the page. See the documentation for the XSD Viewer.
- Save the page.
Resources
Useful links:
- The XSD Viewer page on the Atlassian Marketplace.
- The documentation for the XSD Viewer.
- A getting-started video for the XSD Viewer on YouTube.
- The issue tracker for the XSD Viewer.
Feedback so far
I’ve given Yanne at Avisi some feedback already:
- At first the error messages were a bit too generic to be useful. Avisi have already followed up on this in the latest version of the add-on, which gives more specific error messages. Great!
- Currently the macro autocomplete in Confluence is triggered by “XSD”. Suggestion: Add “schema” and “XML” to the list of triggers.
- Add the option to add a border and other styling to the image.
The Avisi team like the latter two suggestions, and are waiting for more feedback before implementing them. Would you be interested in an XSD viewer in Confluence, and what requirements would you have for it?
Change management helping customers train their staff
For our recent major product release, we published a change management guide as a quick tool for customers to see what’s changed in the product. The document is called Planning for Confluence 5. This post describes how we went about designing the change management guide, and whether it’s working. I hope this information may be useful to other technical writers and change managers.
This week our development team released a major version of the product, Confluence 5.0, with many changes that will affect the way people work. We wanted to make sure administrators know they will probably need to warn their colleagues before upgrading the wiki. We also wanted to give a good overview of the changes that will affect people’s day-to-day activities.
Defining the audience
The release notes do a good job of describing what’s new in the release. That information is primarily useful to people who are looking to purchase the product for the first time, or renew their existing licences. The upgrade notes tell administrators what they need to know before, during and after the upgrade. For this release, we needed a way of telling everybody else what was coming. Enter the change management guide.
Many organisations, especially the larger ones, develop their own procedures and training material for their staff, incorporating guidelines on using our product. The change management guide is also intended to help the people who produce those guides, as it’s a good indication of the areas of major change.
Are people finding and reading the document?
I’ve put a link on the documentation home page as well as in the upgrade notes. I’ve moved the document to the top of the left-hand navigation panel, so that it hits that sweet spot where people look first and most often. I’ve tweeted about the document, and mentioned it on Google+.
Are people finding it, and are they sticking around long enough to read it?
This report from Google Analytics shows the traffic on the page.
Key dates:
- The Google Analytics report covers the period from 10th to 28th February.
- The software release date was 26th February.
- I published the document on 18th February, so that early adopters could see it. (Views before 18th February were therefore all by Atlassian employees.)
- The spike shows the highest number of hits on the date of release (26th to 27th February, in different time zones). The highest number of hits per day is on 27th February, at 486 views.
Interpreting the figures:
- People have looked at the page 2,196 times over the 17-day period.
- The number of unique visits to the page is 1,760 over the same period. So yes, people are finding the page.
- People are spending more than 3 minutes on the page, on average. That would seem to mean they’re reading it, rather than bouncing straight out.
- There’s a 57% bounce rate. In other words, people are reading this page and then leaving, without visiting other parts of the documentation site. That’s a fairly high bounce rate. For this page, I think that’s good thing. The target audience is people who already know what the product does. They don’t need to read the rest of the documentation. They just need to know what’s going to change.
Comparing this page to the Confluence documentation home page:
The home page is the most popular page in the Confluence documentation (apart from an anomaly: a page that tells people how to configure their Java home variable in Windows, which is not specific to our product).
Over the same period of time, the report for the documentation home page shows:
- 12,140 page views
- 9,431 unique page views
- 1.12 minutes average time spent on the page
- 27.8% bounce rate
My interpretation: The new change management page has a surprisingly large number of views in relation to the most popular page. (The number of views of the new page is about 18% that of the home page.) People are spending more time on the change management page than on the home page. People use the home page as a place to click through to other parts of the documentation (hence the lower bounce rate) but they leave the documentation site satisfied, immediately after reading the change management page. Yesssss! 🙂
Hehe of course, you could probably interpret the figures differently.
Designing the content of the change management guide
We chose to design the content around the concepts of before and after, which we termed “previously” and “now”. This is the primary difference between this document and the release notes. The focus of this document is on what’s changed, not on what’s new. In fact, I’d go as far as to say that if there is a new feature in the release that won’t disturb the way people work, it doesn’t need to be part of this guide.
This screenshot (it feels weird taking a screenshot of a document!) shows a couple of the “previously” and “now” images we’ve used to illustrate the changes in functionality.
For the full effect, take a look at the document itself. You can click the images to zoom in: Planning for Confluence 5.
Easy to scan
We want people to find the information they need quickly. The page includes a number of pictures as well as words. Most of the pictures are annotated screenshots. The text is important too: People will search the page (Ctrl+F) to find specific topics.
The page has consistency of terminology and structure. There’s a short introduction, telling people the purpose of the document. There are links to related information, in case someone has wandered in off the Internet and landed on the wrong page (“every page is page one”). The table of contents on the right tells people at a glance what’s on the page.
Plenty of headings mean people can scroll down and absorb the content easily. The structural design of putting the “Previously” sections on the left, and the “now” on the right, makes it easy to grok what’s happening.
Aesthetics
The page looks good. We want readers to have a pleasurable experience of the document, leading into a pleasurable experience of the product.
Feedback
Unsurprisingly, given the way life goes on a wiki, the page has turned out to be a handy place for people to comment on the release, and for the product managers to respond. Docs alive! 😀
New style for Confluence documentation – what do you think
To mark the impending release of Confluence 5.0, we’ve applied a new style to the Confluence documentation. It’s done by means of some snazzy CSS, created by Andrew Prentice, Valter Fatia and Paul Watson.
What do you think of the new look? We’d love your feedback on the styles, the way some information is hidden until you hover over it (try zooming your cursor around the page to find the hidden bits) and the contrast with the standard Confluence 5.0 look.
Our customised styles
We’ve applied custom CSS only to the latest documentation space on the wiki – that’s the documentation for Confluence 5.0. This space is using the Documentation theme, but with a lot of CSS on top:
The standard Confluence 5.0 styles in the Documentation theme
The documentation for Confluence 4.3 is on the same wiki, but we haven’t applied the custom stylesheets. The wiki is already running Confluence 5.0, so you can see the new 5.0 look and feel without any custom styling. The space is using the Documentation theme:
The standard Confluence 5.0 styles in the default theme
Just for completeness, here’s the Atlassian Training space, on the same wiki, but using the default Confluence 5.0 theme:
How do you add CSS to a Confluence space?
It’s all in the documentation: Styling Confluence with CSS.
Thoughts? 🙂