pdftohtml is a utility which converts PDF files into HTML and XML formats. It bases on XPDF. And it is open source and written in C++ .
Usage: pdftohtml [options]
[ ]
-f: first page to convert
-l: last page to convert
-q : don’t print any messages or errors
-h : print usage information
-help : print usage information
-p : exchange .pdf links by .html
-c : generate complex document
-i : ignore images
-noframes : generate no frames
-stdout : use standard output
-zoom: zoom the pdf document (default 1.5)
-xml : output for XML post-processing
-hidden : output hidden text
-nomerge : do not merge paragraphs
-enc: output text encoding name
-dev: output device name for Ghostscript (png16m, jpeg etc)
-v : print copyright and version info
-opw: owner password (for encrypted files)
-upw: user password (for encrypted files)
I have even use it to generate Excel from PDF, converting a 927 PDF file to Excel document.
btw, it supports windows, linux, mac OSX and so on.
pdftohtml DOES NOT do a good job of converting the tables in the source PDF document. I think if one want to judge how good any PDF converter is, it depends on how many different PDF elements that it can process. I guess most of them don’t do that & extracts just the text (without the styles or relations positions, etc.), which is easy to do!
Sorry, I do not what you mean, I just say pdftohtml can help me to do some job, such as convert pdf to html, xml, and even excel.
of course even Acrobat professional 8 can not convert pdf to html or doc very well.
I didn’t use the -c option without which it was converting only to text. But with this option, it converts to elements, so you get at least the look & feel of a table. I was to some extent correct in saying that it WAS NOT converting to HTML tables as such. Figures with boxes & arrows convert only to texts in boxes or arrow labels, but that’s the max we can do in HTML for figures, I guess!