PDF to Word conversion
David Stevenson at 07:53 GMT on 25 March 2010
The popularity of PDF as a means for sharing information via email or the web has had an unintended consequence – the orphan PDF. PDF is a publishing, not an authoring format. If you want to change the content of a PDF, you go back to the source document – its parent if you will. But what if you can’t find the parent? Or don’t trust the source?
For this reason many organisations follow strict procedures to keep track of document versions, or implement systems that do this automatically. Such systems that integrate with popular authoring tools such as Microsoft Word are often provided as a front-end to a document management system (DMS). These systematically store iterations of the document and keep track of versions, so that any time some information is published to PDF its source is known.
Even with systems to help, the situation often arises that a PDF rendition of a document is considered the “master”, for all sorts of perfectly valid reasons. While PDF tools, including our own, can make minor text corrections (fixing a typo for example) the nature of PDF precludes large scale changes. For that you need to convert the PDF back to an authoring format. It turns out that this is a very common problem, and thousands of Google searches on “convert PDF to Word” are carried out every day.
While tools such as ours provide an “Export to Word” feature that does this job at the click of a mouse, the internal process of converting a PDF to Word is actually much harder than one might expect, and difficult to do well. Open a Word document in Page Layout view and a PDF of the same document side-by-side and you could be forgiven for thinking they are the same thing, but in fact the PDF page has an internal structure that is very different. PDF (or XPS for that matter) are fixed formats; they represent what an authoring application would print, and indeed most PDFs are generated by interpreting the data a printer uses to put marks on paper, and storing it in a form that retains all the graphical, layout and images of a printed page. By contrast, an authoring format such as Word’s DOC or DOCX is a “flow” format – the information is essentially a flow of text and graphics that are formatted to fit a page according to rules of layout. Change the width of the margins, for example, and the text will reflow to suit.
In order to convert a fixed format back to a flow format, the program has got to apply the same rules we learned as children about reading a page. In the case of Western documents, this means starting at the top left of the page, working towards the bottom right, following text columns as required, then moving on to the next page and so on. It has to follow rules about how sentences and paragraphs are formed, that text may continue after an intervening picture or diagram and so on. It then has to encode this information in manner that will, as far as possible, reproduce the same layout in the authoring format: page size, margins, text, images, font attributes and much more.
Results vary. PDF documents that convert well to Word are those with straightforward layout; more often than not they started life in Word or another word processor. Complex layouts tend to be more of a challenge; with these, there is often a choice (as there is with our tool) to place an emphasis during conversion on maintaining the layout and appearance of the page at the expense of easy text editing, or vice versa.
So it makes sense to check that a tool you are considering for this task is up to the job. Ours is available with a free trial – just go to http://www.globalgraphics.com/gdoc/free-creator and follow the link to download.



