Extract and Analyze Text Data of PDF Documents with PDFMiner

PDFMiner is a suite of programs that help extracting and analyzing text data of PDF documents. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other extra information such as font information or ruled lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.
The features of PDFMiner are as follow,

  • Written entirely in Python. (for version 2.4 or newer)
  • PDF-1.7 specification support. (well, almost)
  • Non-ASCII languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • PDF to HTML conversion (with a sample converter web app).
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Infer text running by using clustering technique.

PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py,
pdf2txt.py extracts text contents from a PDF file. It extracts all the texts that are to be rendered programmatically, It cannot recognize texts drawn as images that would require optical character recognition. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion. You need to provide a password for protected PDF documents when its access is restricted. You cannot extract any text from a PDF document which does not have extraction permission.

For non-ASCII languages, you can specify the output encoding (such as UTF-8).
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purpose, but it’s also possible to extract some meaningful contents (such as images).
To download the last version PDFminer or for details, please visit http://www.unixuser.org/~euske/python/pdfminer/index.html

Leave a Reply