Three Open source PDF Parser developed in Python

PDFMiner is a suite of programs that help extracting and analyzing text data of PDF documents. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other extra information such as font information or ruled lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.
for details, please visit Extract and Analyze Text Data of PDF Documents with PDFMiner

pdf-parser.py
This tool will parse a PDF document to identify the fundamental elements used in the analyzed file. It will not render a PDF document. The code of the parser is quick-and-dirty, I’m not recommending this as text book case for PDF parsers, but it gets the job done.
for details, please visit, PDF Tools

PyPDF

A Pure-Python library built as a PDF toolkit. It is capable of:

  • extracting document information (title, author, …),
  • splitting documents page by page,
  • merging documents page by page,
  • cropping pages,
  • merging multiple pages into a single page,
  • encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without
any dependencies on external libraries. It can also work entirely on
StringIO objects rather than file streams, allowing for PDF
manipulation in memory. It is therefore a useful tool for websites
that manage or manipulate PDFs.

for details, please visit http://pybrary.net/pyPdf/

Leave a Reply