Blog Archives

Confluence full-text search using Python and grep

The standard search in Confluence wiki searches the visible content of the page. It also offers keywords for some specific searches, such as macro names and page titles. But sometimes we need to find things that the search  cannot find, because the content of the relevant XML elements is not indexed. This post offers a solution of sorts: Copy the XML storage format of your pages into text files on your local machine, then use a powerful search like grep to do the work.

Here are some examples of the problem:

  • We may want to find all pages that reference a certain image, or other attachment. It’s easy enough to find the page(s) where the image is attached. But it’s not possible to find all pages that display a given image which is attached to another page.
  • It’s possible to search for all occurrences of a macro name, using the macroName: keyword in the search. But it’s not possible to search for parameter values. This means, for example, you can’t search for all pages that include content from a given page.

I’ve written a script to solve the problem, by downloading the storage format from Confluence onto your local machine, where you can use all sorts of powerful text searches. You’re welcome to use the script, with the proviso that it’s not perfect.

Python script: getConfluencePageContent

The script is in a repository on Bitbucket: https://bitbucket.org/sarahmaddox/confluence-full-text-search.

Note: To run the script successfully, you need access to Confluence, and the Confluence remote API must be enabled.

Installing Python

To run the script, you need to install Python. The scripts are designed for Python 3, not Python 2. There were fairly significant changes in Python 3.

  1. Download Python 3.2.3 or later: http://www.python.org/getit/
    (I downloaded python-3.2.3.amd64.msi, because I’m working on a 64-bit Windows machine.)
  2. Run the installer to install Python on your computer.
    (I left all the options at their default values.)
  3. Add the location of your Python installation to your path variable in Windows:
    1. Go to ‘Start’ > ‘Control Panel’ > ‘System’ > ‘Advanced system settings’
    2. Click ‘Environment Variables’.
    3. In the ‘System variables’ section, select ‘Path’.
    4. Click ‘Edit’.
    5. Add the following to the end of the path, assuming that you installed Python in the default location:
      ;C:\Python32
    6. Click ‘OK’ three times.
    7. Open a command window and type ‘python’ to see if all is OK. You should see something like this:

Confluence full-text search using Python and grep

Getting the script

Go to the Bitbucket repository and choose ‘Downloads’ > ‘Branches’, then download the zip file and unzip it into a directory on your computer.

Running the script to get the content of your pages

To use the getConfluencePageContent script:

  1. Enable the remote API (XML-RPC & SOAP) on your Confluence site.
  2. Open the getConfluencePageContent script in Python’s ‘IDLE’ GUI.  (Right-click on the script and choose ‘Edit with IDLE’.)
  3. Run the script from within IDLE. (Press F5.)
  4. The Python shell will open and prompt you for some information:
    • Confluence URL – The base URL of your Confluence site. If the site uses SSL, enter ‘HTTPS’ instead of ‘HTTP’. For example: https://my.confluence.com
    • Username – Confluence will use this username to access the pages. This username must have ‘view’ access to all the spaces and pages that you want to check.
    • Password – The password for the above username.
    • Space key – A Confluence space key. Case is not important – the match is not case-sensitive.
    • Output directory name – The directory where the script should put its results. The script will create this directory. Make sure it does not yet exist.
  5. Look for the output directory as a sibling of the directory that contains the getConfluencePageContent script. In other words, the output directory will appear in your file system at the same level as the script’s directory.
Python Shell

Python shell (IDLE)

 

Output of the script

The Bitbucket repository contains an example of the output, based on the Demonstration space shipped with Confluence. See the outputexample directory in the repository. For example, this file contains the content of the page titled ‘Welcome to Confluence’.

The script gets the content of all pages in the given Confluence space. It puts the content of each page into a separate text file, in a given directory.

The script creates the output directory as a sibling of the directory that contains the getConfluencePageContent script. In other words, the output directory will appear in your file system at the same level as the script’s directory.

The file name is a combination of the page name and page ID. To prevent problems when creating the files, the script removes all non-alphanumeric characters from the file name. To ensure uniqueness, it appends the page ID to the page name when creating the file name.

The content is in the form of the Confluence storage format, which is a type of XML consisting of HTML with Confluence-specific elements. (Docs.)

The script also writes a line at the top of each file, containing the URL of the page, and marked with asterisks for easy grepping.

Notes:

  • The script will show an error if the output directory already exists.
  • If you see the following error message, you need to enable the remote API (XML-RPC & SOAP) on your Confluence site: xmlrpc.client.ProtocolError: <ProtocolError for localhost:8090/rpc/xmlrpc: 403 Forbidden>

Grep and winGrep

Now that you have the page content in text form, the world’s your oyster. 🙂 You can use the full power of text search tools. If you’re on UNIX, you’ll already know about grep.

If you’re on Windows, let me introduce grepWin. It’s a free, powerful search tool that you can install on Windows. It offers regular expression (regexp) searches as well as standard searches, and it has a very nice UI (user interface).

This screenshot shows a search for an image called ‘step-2-image-1-confluence-demo-space.png’. The image is attached to one page, and referenced in two pages. QED. 😀

grepWin

grepWin

 

Comments welcome!

I’d love to know if you think you’ll find the script useful, and if you have any ideas for improving it.

How to search for macros and macro parameters in Confluence 4

Do you need to find all pages that use a given macro with given parameters and parameter values? For example, do you need to find all pages that use the Excerpt Include macro to include content from a given page? Or all pages that display the children of another page? Or all pages that display a given PDF file?

In a standard Confluence 4 installation, that is not possible because the macro parameters are not indexed. There is a new plugin available that makes the search possible: the Confluence Macro Indexer plugin.

A quick note before we start

Using the standard Confluence functionality, you can already search for occurrences of a given macro (but not its parameters). Enter the following in the Confluence search box, assuming that your macro name is “x”:

macroName: x*

You don’t need a plugin to use the above syntax. For details, see my previous post: How to search Confluence for usage of a macro

What’s new

The Confluence Macro Indexer plugin enhances the Confluence search, so that you can search for specific macros, macro parameters and/or parameter values. With the plugin installed, your Confluence site will allow a new search field name that you can enter into the Confluence search box, followed by a colon, like this:

wikiMarkup:

The plugin is available for Confluence 4.0 and later. I’m using Confluence 4.3.

How to use the macro search

Before you can use the macro search, you or your administrator must install the plugin onto your Confluence site. Installation structions are at the end of this post.

To use the search:

  1. Enter 'wikiMarkup:' followed by the macro name and/or parameter, into the Confluence search box at top right of the Confluence screen, or on the standard Confluence search screen.
    For example: To find all occurrences of the Include macro that include a page called ‘Introduction to chocolate’:

    wikiMarkup:"include:Introduction to chocolate"

    There are more examples below.

  2. Press Enter or choose Search. The search results will appear as usual.

Notes:

  • After installing the plugin (see instructions below), you will need to reindex Confluence to ensure that all macros on existing pages are added to the index. See the guide to administering the Confluence content index.
  • The field name is case sensitive. You must enter ‘wikiMarkup:‘, not ‘wikimarkup:‘ or any other combination of case.
  • The macro names, parameter names and parameter values are not case sensitive.
  • You will need to include double quotation marks around the text after the field name, if the text includes special characters such as a colon. For example, enter the following code to find all occurrences of the Include macro that include a page called ‘Introduction’:
    wikiMarkup:"include:Introduction"
  • To use the search, you need to know the wiki markup for the macro name, the parameter names, and the accepted parameter values. The Confluence 4.x documentation contains such information for some of the macros. See https://confluence.atlassian.com/display/DOC/Confluence+Wiki+Markup+for+Macros. For the other macros, please refer to the Confluence 3.5 documentation: http://confluence.atlassian.com/display/CONF35/Working+with+Macros.
  • The plugin enhances the search only for macros and macro parameters – not for URLs in links or other such items.
  • The works for user macros too.

Examples

Find all pages that use a given macro

To find all pages that use the Excerpt Include macro:

wikiMarkup:excerpt-include

It works with quotation marks too:

wikiMarkup:"excerpt-include"

Find all pages that include an excerpt from a given page

To find all occurrences of the Excerpt Include macro that include content from a page called ”Introduction to chocolate’:

wikiMarkup:"excerpt-include:Introduction to chocolate"

Find all pages that include a given page

To find all occurrences of the Include macro that include a page called ‘Introduction to chocolate’:

wikiMarkup:"include:Introduction to chocolate"

Note: In fact, the above search will find the occurrences of the Include macro as well as the Excerpt Include macro.

Find all pages that include any page with a name starting with a given string

To find all occurrences of the Include macro and the Excerpt Include macro that reference any page that has a name starting with ‘Introduction’:

wikiMarkup:"include:Introduction"

Find all pages that display a given PDF file using the View PDF macro

To find all occurrences of the View PDF macro that display the file ‘ChocRecipe.PDF’:

wikiMarkup:"viewpdf:name=chocrecipe.pdf"

Find all pages that display the children of a given page

To find all occurrences of the Children macro that display the children of a page called “Chocolate home”:

wikiMarkup:"children:page=chocolate home"

Find all occurrences of all macros referencing a specific parameter

You can find all macros that have a specific parameter and a specific value assigned to that parameter. For example: If there are two macros, MacroA and MacroB, that have the parameter "page=My page name", then your search will pick up all pages that contain either MacroA or MacroB.

This search query will pick up all Children macros, just like the previous example. It will also pick up any other macros that have the same “page” parameter.

wikiMarkup:"page=chocolate home"

Screenshots

Searching for all occurrences of the Include and Excerpt Include macros that include content from a page called ”Introduction to chocolate’:

Searching for all occurrences of the View PDF macro that displays the file ‘ChocRecipe.PDF’:

How does it work?

In order to understand how to use the enhanced search, it’s useful to know exactly what the plugin does.

The plugin converts the macro code from Confluence 4 storage format to Confluence 3 wiki markup, and then adds the result to the Confluence index, thus making it searchable.

Let’s take the Excerpt Include macro as an example. The plugin takes the following code:

<ac:macroac:name="excerpt-include"><ac:parameter ac:name="nopanel">true</ac:parameter><ac:default-parameter>Introduction to chocolate</ac:default-parameter></ac:macro>

and converts it to this:

{excerpt-include:Introduction to chocolate|nopanel=true}

It then adds the above text to the Confluence index.

How to install the plugin

You can install the plugin via the Confluence plugin manager, just like any other plugin.You need Confluence System Administrator permissions to do this.

If your Confluence site is open to the Internet:

  1. In Confluence, choose “Browse” > “Confluence Admin” > “Plugins”.
  2. Choose “Install Plugins”.
  3. Type “macro indexer” in the search box on the “Install Plugins” tab, and choose “Search”.
  4. Click the plugin name “Confluence Macro Indexer” to see the plugin details.
  5. Choose “Install”.
  6. When the plugin is successfully installed, rebuild the Confluence index to ensure that all macros on existing pages are added to the index. See the guide to administering the Confluence content index.

If your Confluence site cannot access the Internet:

  1. Download the plugin JAR from tha Atlassian Marketplace: https://marketplace.atlassian.com/plugins/com.atlassian.confluence.plugins.confluence-macro-indexer-plugin
  2. Save the JAR file somewhere on your local computer.
  3. In Confluence, choose “Browse” > “Confluence Admin” > “Plugins”.
  4. Choose “Install Plugins”.
  5. Choose “Upload Plugin”, browse to the saved JAR file, and upload it.
  6. When the plugin is successfully installed, rebuild the Confluence index to ensure that all macros on existing pages are added to the index. See the guide to administering the Confluence content index.

The Confluence Macro Indexer plugin installed

AODC day 1: Turning Search into Find

This week I’m at AODC 2010: The Australasian Online Documentation and Content conference. We’re in Darwin, in the “top end” of Australia. This post is my summary of one of the sessions at the conference. The post is derived from my notes taken during the presentation. All the credit goes to Matthew Ellison, the presenter. The mistakes and omissions are all my own.

Matthew Ellison presented the first session of the conference. He called it “Turning Search into Find”. Tony Self, conference organiser extraordinaire, performed the introduction: “Matthew is from the UK. I must apologise for that”! I guess this gives you an idea of the informal nature of this conference. 🙂

Matthew’s talk covered these topics:

  • Why search is important.
  • Why search doesn’t always find, and what the obstacles are.
  • *Innovative search techniques that clever people are using on the web.
  • The top 10 factors that will make your search more effective. Sometimes we have control of or input into the choice of the search tool.
  • Some practical pointers towards implementing a good search.

News flash: Matthew can use his phone as a remote clicker to move through his presentation.

AODC day 1 - Turning Search into Find

AODC day 1 - Matthew talking about "Turning Search into Find"

Why search is important

Matthew pointed out that search is not necessarily the best tool for finding information, but it’s the one that most people want to use. They’re accustomed to using it, from their frequent use of Google and other web searches. Gone are the days when people are accustomed to using the index at the back of a book. Matthew quoted a study where half the people were given a tool with an index at the back and half had the search only. Results showed that the people who used the index were much more effective in finding the information. But when asked, the ones who used the search were more satisfied with the tool.

Many help systems now don’t have an index or table of contents at all. So the search had better be good!

Difference between find and search

Many tools have changed the word “find” to the word “search”. Even Windows did this a while ago. As Matthew said, the difference between the two terms is interesting. It’s a pity we can’t guarantee that people will find the information any more, just that they can search for it!

Matthew asked us to name some problems we may find with search. We came up with these:

  • Too many hits.
  • Synonyms. Search works well if you use the right word. But if you use the wrong word, you don’t find the information.
  • Stop words. The search tool overrides your terms because it thinks your term will return too many results. This means sometimes you can’t find what you’re looking for.
  • Complex search parameters, such as quotes, AND, OR etc. These conventions should be common across all searches.
  • You can’t ask questions.

Innovation in search techniques

Now Matthew showed us some innovative approaches that may help to improve the situation.

Google Suggest

A new development has appeared in Google search over the last 18 months: “Google Suggest”. Matthew calls it predictive search. As you start to type your search term, Google predicts and suggests what you want.

Personally, Matthew finds this has more impact than Google Wave, even though Google made far less fuss about it.

As you type, predictive search suggests the most common keywords that people have used that match your term. Then you can select the term from the dropdown list.

This reminded Matthew of the old experience of using an index but better, because not only does it give the first match alphabetically, it also gives the most popular match.

In an even more recent development, Google Suggest also takes into account your own recent searches.

Matthew had some fun asking us to guess the Google search suggestions for some phrases. Some of them were:

  • “What is” yields “What is my IP address?”
  • “How much wo” yields “How much wood would a woodchuck chuck”
  • “I like to ta” yields “I like to tape my thumbs to my hands to find out what it’s like to be a dinosaur”

BBCI-Player

Provides an auto-suggest for its search.

Confused.com

Offers a list of choices based on what you type in. There is some synonym matching too. For example, if you type “IT”, it offers a list of jobs starting with “Computer”.

Railsaver.co.uk and British Airways (BA.com)

The dropdown suggestions also give you results where the middle of the word or phrase matches your search term. This is useful where you don’t know the official name of the station or airport.

Back to Google search

Google search does this now too. For example, the term “and bec” will bring up “posh and becks”. Google will also offer you alternative spellings.

Need to balance lots of functionality with ease of use

Many searches require you to understand boolean parameters. You need to know the difference between AND and OR.

Two online bookshops have different ways of balancing ease of use with useful functionality in their searches.

Borders UK (alas, now out of business) had a search that allows you to enter the title, author or ISBN. It used predictive technology. It also categorised the results into groups, showing a group of all the books that match the results, and another group of all the people whose names match.

Blackwells offers a very simple search and also a separate advanced search, where you can fill in a lot of detail.

Faceted search

Faceted search is an alternative to a table of contents. The search classifies information by specific characteristics (facets). People can select what they’re interested in and drill down, in any order, as opposed to a table of contents which presents the information in a specific structure.

Examples:

Matthew introduced the concept of the “scent of information”: If people can see that they’re getting nearer to the information that they’re want, they’re quite happy to keep combining facets to narrow down their search.

What turns “search” into “find”?

Matthew gave some hints about how to make a search as useful and effective as possible:

  • “Stop” words let you exclude specific words from the index. This is useful to reduce the number of irrelevant hits. On the other hand, it may cause problems, for example if you want to search for “sort by date” and the word “by” has been excluded.
  • More useful is the ability to exclude certain topics from the search. For example, it makes sense to exclude popup topics or context-sensitive topics from the search results.
  • The search results should include an extract from the destination page. This is called “synopses” or “context”.
  • Boolean search (using AND, OR and NOT) gives the user the power to increase or decrease the number of results returned. Interesting: Google uses an implied AND, whereas most help tools use an implied OR by default. Bear this in mind, that your users may be used to one or the other way of searching. For example:
    • Adobe AIR Help and WebHelp default to OR. Users can explicitly type AND or OR.
    • Same for MadCap WebHelp.
    • ComponentOne NetHelp defaults to OR and does not allow users to enter specific boolean terms.
    • Etc.
  • Phrase matching allows users to enter phrases in quotes.
  • Fuzzy matching — it would be great if the search knew a bit about linguistics and could offer related words. Google is really good at this sort of thing.
  • Faceted search and search filtering. A while ago, Microsoft had the concept of “Information Types”, but this never really came to anything. MadCap Flare’s WebHelp and DotNetHelp do support “concept keywords” and “search filters”.

The techniques we can use in user assistance

Here are some examples of the kind of faceting could we use in user assistance:

  • Role (administrator or user)
  • Work role (accounts or human resources)
  • Experience (beginner, advanced, etc)
  • What kind of information do you want? (Step by step, conceptual, etc.)

Ranking, such as by number of occurrences of the key word, or by metadata.

Metadata is the key to flexible and effective search. So the search looks not only at the content, but also at other information that the author has added to the topic. This can help with synonym matching, ranking, etc. RoboHelp 8 has some great tools for adding search keywords manually and for auto-adding index keywords as metadata.

Predictive search is great. This reduces the number of keystrokes the user has to make. There’s no excuse for our help not to use auto-suggest. It provides a better “scent of information”.

Worth thinking about: Predictive search may have a negative aspect, in that it channels us all towards the same search and therefore maybe the same content. This could cut out other content that people may have found by entering less popular search terms.

Matthew’s presentation also contains references to ways of implementing predictive search. For example, Google custom search and technologies such as PredictAd. The latter works in a very similar way to Google Suggest. Matthew spoke to the PredictAd developers and they said there’s no reason it shouldn’t be used for user assistance or documentation.

Adobe Forums used to have an awesome predictive search. (Adobe Forums don’t use this technology any more.) They categorised the search results, similar to the way Blackwells do. It was powered by technology from Jive Software: Clearspace. Matt’s presentation contains a basic specification of how it works. I’m sure he’d send it to you if you’re interested.

During question time, Choco recommended that we look at eBay for a good example of faceted search.

My conclusion

This was a great presentation full of information, fun and interactivity. Thank you Matthew!

Update on 30 May 2010: Matthew’s slides for this presentation, “Turning Search into Find”, are now available for downloading from the Matthew Ellison Consulting web site.

What Confluence 2.9 and I have in common

I’m good at finding things. And so is Confluence 2.9 😉

A couple of days ago, doom and gloom broke out at home. My son had lost his wallet. He had spent half an hour of the precious morning rush-hour looking for it, but to no avail. I’m sure you can identify with the atmosphere that hung over the household.

Now, it just so happens that I have a knack for finding things. Are you like that too? I walked into my son’s room and started the usual questions.

“Where were you when you last saw it?” At home.”

“Were you wearing those pants?” Yes.“Oh dear.”

While I was talking, I drifted around the room. In the middle of the word “dear”, I lifted a jacket from a chair and there was the wallet hiding underneath. My reward was a bit of disbelief and a somewhat reluctant lifting of the doom and gloom.

Confluence finds things too

At work, we’ve just released Confluence 2.9. One of its main features is a revamped search. My favourite bit is the author search. It’s fun, interesting, and can be a bit of an ego boost 😉 First you search for a specific word or phrase as usual, then you enter a person’s name to find out which of the found pages, comments or whatever, were contributed by that person.

On our documentation wiki, I searched for “confluence OR crowd OR fisheye OR crucible OR jira OR bamboo OR clover” (because those are the main Atlassian products we document) and then entered my own name in the “Who” box. 3,436 results. Not bad for a year’s work, huh.

Then I tried it for our two founders, Mike Cannon-Brookes and Scott Farquhar. Scott gets 55 results for “jira” and 27 for “confluence”. Mike gets 229 results for “confluence”:

What Confluence 2.9 and I have in common

What Confluence 2.9 and I have in common

At work, we talk with pride about “founder code”. That’s the code Scott and Mike wrote in the early days, which still exists in the products. I think we can talk about “founder docs” too!

Demo space and quick-start guide

Another 2.9 feature dear to my heart, as a technical writer, is the Demonstration Space. This is a sample set of pages that is included in the Confluence download. Two of us worked hard on it for this release, adding a quick-start tutorial and bringing the content up to date. The Design team hotted up the look and feel too.

What Confluence 2.9 and I have in common

What Confluence 2.9 and I have in common

Other technical writers will know how valuable such a quick-start guide and sample content can be, and also how time-consuming it is to find just the right balance of detail and depth.

The Demo Space is a work in progress. We’re tackling it as an “agile” project, publishing the new enhancements with each Confluence release.

Back to my tale about finding things

I found that wallet within thirty seconds. It’s happened before, that I stumble across something that someone else has lost, just a short time after starting the search. Perhaps this knack is thanks to a stubborn refusal to accept that something can disappear off the face of the earth. Or perhaps it’s just lack of imagination.

What Confluence 2.9 and I have in common

What Confluence 2.9 and I have in common

Anyway, I know where all those odd socks are. That’s one up on Confluence 😉

%d bloggers like this: