AODC Day 3: Converting to Structured Content
This week I’m at AODC 2010: The Australasian Online Documentation and Content conference. We’re in Darwin, in the “top end” of Australia. This post is my summary of one of the sessions at the conference. The post is derived from my notes taken during the presentation. All the credit goes to Dr Alan Burton, the presenter. Any mistakes or omissions are my own.
The first session on Friday was titled “Converting to Structured Content”, by Dr Alan Burton. Introducing his talk, Alan showed us some “scary” pictures of the world he lives in. Scary indeed — a wall of codes. With a charming smile, he admits to being a “hardcore geek”. From the moment Alan starts speaking, we see that he has a good dose of the AODC sense of humour. 🙂
The problem that Alan’s talk addressed is this: You’re told that you need to move your content to XML. You have loads and loads of unstructured content. It’s in FrameMaker, Word, other desktop publish applications, or even more fun: it’s on paper.
Who does the conversion?
Alan discussed various options to consider when deciding who will convert the documents to a structured format:
- You can do it yourself, if you have the technical knowledge required.
- The company’s IT department could do it. Alan jokingly referred to the “IT Crowd” — are you happy for Moss to do your conversion? 😉
- You can outsource it to someone like Alan. (He acknowledges with another disarming smile that there’s no-one quite like him.)
- If you consider outsourcing the work to an overseas company, take into account that this can lead to difficulties when communicating requirements.
You need a sample of the documentation that’s to be converted. What’s more, it must be a representative sample. For example, check whether the documentation as a whole contains tables, images, special characters, mathematical symbols. If so, are they represented in the sample? For best results, you need to be given the full documentation set in the analysis phase.
Find out whether the documents comply to a template. If the answer is yes, ask how long the documentation has been around and how many authors have had a hand in it. What are the chances that it really does comply to the template?
Here’s Alan’s recommendation, assuming your source documents are in Word: Convert all the Word documents to RTF and then feed them through an analysis tool. He extracts a complete list of all styles that appear in the documents. If there is a large number of styles, you may have a problem. He is building a set of utilities that let him look at styles, special characters, symbols etc within the documents, without himself having to read each and every document. In this way, he builds up a picture of how easy or difficult it will be to convert the document.
Ask questions such as: “I see that you use italics to represent words that appear in the glossary. Do you use italics for any other purpose, such as titles or Latin words?”
Do the analysis before you choose or create your DTD, and before you buy any software.
Document the results of your analysis in detail.
After the analysis phase, you know what you have, but you still need to find out what you actually want. Things to consider:
- Indexing for search
- Online and/or paper output
Alan creates a “markup specification”, which is a detailed requirement specification for the conversion. You need to add examples for each requirement. This documentation provides the input into the design of your DTD.
Alan creates a document that maps the styles of the source document to the XML elements and attributes. The developers will use this document to create or customise the software that does the conversion.
Checking the results of the conversion
When you receive the converted documentation, you need to check it to see if it’s actually what you want.
These are some of the problem areas that Alan mentioned:
- Graphics — format, naming conventions, storage
- Forms — these are very hard to recreate in XML
- Mathematic and scientific symbols and formulae
FrameMaker has a Migration Guide that tells you how to convert from unstructured FrameMaker to structured FrameMaker content. This is a useful tool. You can then export to XML. You do need to do some cleanup afterwards, either manually or with scripting.
You could convert from Word to FrameMaker, then to Structured FrameMaker.
Converting from Word to XML, you usually convert from Word to RTF and then to XML. Alan has done this once, in a very controlled environment and with just one document. It can become fairly complicated.
Converting to DITA is very challenging. You will probably need to split input files into different output files. It is unlikely that you can automate the insertion of metadata. You may need to rename graphics and create multiple graphics formats.
You will never end up with a clean result immediately, no matter what any tool may claim. You will always need to do some manual cleanup, and you will probably need to get a programmer to do some scripting.
This was a good, fast-paced walk through some scenarios and guidelines for converting from unstructured to structured content. I could tell that Alan has a lot more information to share. Thank you for a great session, Alan.