Mark D. Anderson (mda@discerning.com)
created: October 10 2005
updated: March 27 2007
Introduction
Being able to convert PDF files to some sort of XML would have all sorts of uses:
There are a gazillion tools for slicing and dicing XML.
You can read XML in a text editor.
It allows for an approach to PDF modification that doesn’t
entail learning somebody’s PDF API.
It can become the basis for interconversion among document formats.
Before going any further, let me clarify something about PDF, because I’ve gotten
multiple queries from people on the net who have stumbled onto this page, who seem
to be under the impression that PDF has useful semantic information in it.
In general it doesn’t. There is the ability to embed actual XML (see XMP
at http://www.adobe.com/products/xmp/standards.html ). And some PDF authoring
programs support “tagged PDF”. Tagged PDF is the moral equivalent of the “alt”
tag in HTML — it indicates the logical structure of the pages for alternate reader
software, for accessibility purposes. This structure uses a vocabulary unique
to PDF (“Art” for article, “Sect” for section, etc.) It is not stored in the
PDF format as XML (unlike XMP metadata), but as a content stream.
But a random PDF file quite likely has none of this.
Instead it just has the equivalent of dumbed-down postscript.
That means that even the concept of a word is not always apparent, given
that some PDF generation software will absolutely position individual characters.
Reconstructing a paragraph would require heuristics to guess where columns
are, and then undoing any hyphenation that was done.
For an excellent
article on that challenge (written by some engineers at a small firm acquired by Apple) , see
http://www.idealliance.org/papers/dx_xml03/papers/05-03-03/05-03-03.html
Also see Tamir Hassan’s project report at http://www.dbai.tuwien.ac.at/staff/hassan/pdf2html/final.pdf
as well as his other publications at www.tamirhassan.com
If you happen to get a “tagged PDF” and in fact it has tagged all the content
(which is difficult to verify) then you are home free. But don’t assume you’ll find that.
Free Solutions
Given how obviously useful this would be, you might think that there would be
a whole bunch of free tools
for doing this. Well, there aren’t any. Zippo. There are at least a half-dozen
active java-based PDF library projects, and none of them support this.
There is an equal number of free libraries in other programming languages (combined), and
they don’t do it either.
To be clear, you can do the following with free software:
- extract just the images, albeit with no positioning information. For example, pdfimages from
xpdf will do this. - extract just the text, with no images. Some of the better tools have multiple modes
for text extraction; one that just pulls it out as it appears, and another that tries
to position it like the PDF file intended, assuming a fixed-width font. For example:- pdftotext (also from xpdf) will extract text with or without layout. Unfortunately
when text is extracted without layout, it goofs up on multi-column text and is unreadable.
When extracted with layout, it is readable to a human, but it would still be necessary to
write your own heuristic software to extract continuous text. - multivalent (from multivalent.sf.net) also has multiple text extraction modes in its ExtractText tool.
However, in one mode you end up with individually positioned words, and in the other mode you do
get whole columns, but the paragraphs run together.
- pdftotext (also from xpdf) will extract text with or without layout. Unfortunately
Beyond that, the pickings are meager.
There is pdftohtml.sourceforge.net, untouched since 2003. It is a C++ GPL tool using xpdf
(an old 2.x version).
It extracts text, and puts all the images in each page
into a single background image. It doesn’t extract anything else.
The generated html uses absolute positioning on a line-by-line bases (when the -c option is used).
There is Tamir Hassan’s pdf2html at http://www.dbai.tuwien.ac.at/staff/hassan/pdf2html/index.html
It is reliant on an obsolete version of jpedal. It does not extract images, does not support recent PDF format versions, and
messed up the character sets on the files I tried.
There is pdftoxml.sourceforge.net, also untouched since 2003. It is an LGPL java tool, using an old snapshot of JPedal.
It has zero documentation, and no released files, but it does have CVS.
It built fine using ant, but barfed with a NumberFormatException on the first pdf file I handed to it.
It looks from browsing the source code that it iterates through the Page list
and for each will dump Lines, Polygons, and Boxes.
pstoedit has a plugin with the ability to convert PDF to SVG.
This plugin is closed source shareware (though the base pstoedit is open source); available for Windows or Linux.
Useful certainly, but it doesn’t extract everything, and since no source is available, it
is hard to do more with it.
Commercial Solutions
On the commercial side,
Acrobat 6.0 has a “Save As XML”. But it is a joke, saving even less than
information than Acrobat’s “Save as HTML” — which is itself quite poor,
doing worse than even the free tools like pdftohtml.
The various Acrobat “Save As XML” and “Save as HTML” basically save a collection of images.
It seems to do a better job for example than Apple Preview at saving a page as an image —
if that is what you want. ImageMagick “convert” is also good at converting a pdf page to a single image.
Adobe InDesign 2.0 does have some limited support for XML (basically, importing/exporting PDF tags).
Adobe hosts an online conversion service at http://www.adobe.com/products/acrobat/access_onlinetools.html .
There are several third-party commercial tools such as
these that convert PDF to various XML vocabularies:
- www.pdftransformer.com ABBYY PDF Transformer
- www.cambridgedocs.com xDoc Converter (to XHTML, XSL:FO, or XML)
- www.bcltechnologies.com Magellen (converts multiple formats to HTML)
- snowtide.com PDFTextStream. See also their online service at pdftextonline.com .
- www.texcel.com FormBridge
- www.mattercast.com SVG Imprint
- www.pdftron.com PDF2SVG
- www.exegenix.com ECS (seems to be a consulting technology only)
I don’t know how well they do except for the ABBYY product — and I can attest to it doing a decent job.
It will stitch together columns, and will undo hyphenation. They leverage their OCR engine to recognize
columns. It is only available for Windows, and does not appear to be scriptable in a server application.
They do separately sell an SDK which runs on unix-like systems (abbyy.com/sdk/)
but I don’t know what pricing is like.
There are other commercial tools that go from some XML vocabulary to PDF:
- www.princexml.com
- www.aspose.com
- www.reportlab.com (RML2PDF and PageCatcher are not open source)
- www.fytek.com
What I did
I grabbed the latest CVS of pdfbox.org and hacked up COSWriter.java.
I did not modify any files, I added files only:
- org/pdfbox/pdfwriter/COSWriterVisitor.java – pull out the generic stuff that deals with encryption and the crufty Visitor interface
- org/pdfbox/pdfwriter/COSWriterAbstract.java – an abstract class for exporting PDF documents
- org/pdfbox/pdfwriter/COSWriterXML.java – an implementation to export to PDF
- org/pdfbox/pdfwriter/COSWriterPDF.java – a reimplementation of COSWriter.java; should behave identically.
- org/pdfbox/AsXML.java – the command line utility
The result of running “java org.pdfbox.AsXML foo.pdf foo.xml” gives a foo.xml file that looks something like:
<document> <header> %PDF-1.4 </header> <body> <Dictionary key='1' gen='0'> <dict> <dictentry name='Pages'><ref key='2' gen='0'/></dictentry> <dictentry name='Type'><name>Catalog</name></dictentry> <dictentry name='PageLabels'><ref key='3' gen='0'/></dictentry> <dictentry name='Metadata'><ref key='4' gen='0'/></dictentry> </dict> </Dictionary> <Dictionary key='5' gen='0'> <dict> <dictentry name='ModDate'><string>D:20050910211227-04'00'</string></dictentry> <dictentry name='CreationDate'><string>D:20050910211227-04'00'</string></dictentry> <dictentry name='Title'><string>Unknown</string></dictentry> <dictentry name='Creator'><string>QuarkXPress: pictwpstops filter 1.0</string></dictentry> <dictentry name='Producer'><string>Acrobat Distiller 6.0.1 for Macintosh</string></dictentry> ...
This is really just a proof of concept. There are a whole bunch of problems:
- There is no escaping of xml characters such as <>& in strings.
- There is no special treatment of contents which is actually XML, such as XMP.
- There is no handling of variable character encoding — raw bytes are always just written out directly,
and there is no declared character encoding at the XML document level. - There is no DTD.
- There is no reverse parser implemented to roundtrip from XML back to PDF.
- There is no optimization to remove useless levels of object references.
- There is an extra nesting level like
to be consistent about which objects can have id and generation numbers. - Dictionaries that have a “Type” or “Subtype” key should probably be inverted so that the particular object is made the xml container (“Catalog”, “Page”, “Font”, “Image”, etc.)
- The array representation is quite verbose, particularly for arrays of integers.
- It puts out a xref entries with offsets of -1, since such information is pretty irrelevant in an XML output.
- checkstyle has a fair number of complaints.
- The low-tech XML indenting isn’t very good.
Some of these problems could be addressed by an XML transformation after the fact.
The ones having to do with encodings are pretty crippling though.
Also, for all I know this might not have been the right interception point in the bowels of pdfbox.
For example, a trace function could be introduced at parsing time to do this on the fly.
Also, this could be done instead by directly using the XML SAX2 APIs (either
at parse or at save time).
But the main problem is that this particular XML export is too low level for most purposes.
It useful for some things — and it certainly ensures roundtripping with perfect fidelity —
but most of the time you don’t want a dump at the level of the primitive
PDF atoms like Array, Dictonary, Float, String, etc.
Rather, what you want are objects like Text, Image, Font, Script, Page, Graphic, etc.
You also want:
- The ability to have particular kinds of objects (Image or Font for example), to be
exported into external files rather than inlined, or suppressed entirely. - The ability to control any conversions applied to Font or Image representations.
- The ability to control the extent to which text is reconstructed — as lines, or as paragraphcs, or as article “beads”.
- The ability to control any automatic “zoning” that is attempted.
At the moment I’m not planning pushing this much further.
I was just scratching an itch…
-mda
P.S.: there is a bug in COSWriter.java where one of these tests should be on name, not on value:
if (value != null) { // this is purely defensive, if entry is set to null instead of removed if( value != null ) {
From: Converting PDF to XML