Archive for October 19th, 2009

Three Open source PDF Parser developed in Python

PDFMiner is a suite of programs that help extracting and analyzing text data of PDF documents. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other extra information such as font information or ruled lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.
for details, please visit Extract and Analyze Text Data of PDF Documents with PDFMiner

pdf-parser.py
This tool will parse a PDF document to identify the fundamental elements used in the analyzed file. It will not render a PDF document. The code of the parser is quick-and-dirty, I’m not recommending this as text book case for PDF parsers, but it gets the job done.
for details, please visit, PDF Tools

PyPDF

A Pure-Python library built as a PDF toolkit. It is capable of:

  • extracting document information (title, author, …),
  • splitting documents page by page,
  • merging documents page by page,
  • cropping pages,
  • merging multiple pages into a single page,
  • encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without
any dependencies on external libraries. It can also work entirely on
StringIO objects rather than file streams, allowing for PDF
manipulation in memory. It is therefore a useful tool for websites
that manage or manipulate PDFs.

for details, please visit http://pybrary.net/pyPdf/

Share and Enjoy:
  • Digg
  • del.icio.us
  • Netvouz
  • DZone
  • ThisNext
  • MisterWong
  • Wists
  • BlinkList
  • blogmarks
  • blogtercimlap
  • connotea
  • DotNetKicks
  • Fark
  • Fleck
  • Gwar
  • Haohao
  • IndianPad
  • Internetmedia
  • LinkaGoGo
  • MyShare
  • Netscape
  • NewsVine
  • Rec6
  • Reddit
  • Scoopeo
  • Slashdot
  • StumbleUpon
  • Technorati
  • Webride

Extract and Analyze Text Data of PDF Documents with PDFMiner

PDFMiner is a suite of programs that help extracting and analyzing text data of PDF documents. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other extra information such as font information or ruled lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes instead of text analysis.
The features of PDFMiner are as follow,

  • Written entirely in Python. (for version 2.4 or newer)
  • PDF-1.7 specification support. (well, almost)
  • Non-ASCII languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • PDF to HTML conversion (with a sample converter web app).
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Infer text running by using clustering technique.

PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py,
pdf2txt.py extracts text contents from a PDF file. It extracts all the texts that are to be rendered programmatically, It cannot recognize texts drawn as images that would require optical character recognition. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion. You need to provide a password for protected PDF documents when its access is restricted. You cannot extract any text from a PDF document which does not have extraction permission.

For non-ASCII languages, you can specify the output encoding (such as UTF-8).
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purpose, but it’s also possible to extract some meaningful contents (such as images).
To download the last version PDFminer or for details, please visit http://www.unixuser.org/~euske/python/pdfminer/index.html

Share and Enjoy:
  • Digg
  • del.icio.us
  • Netvouz
  • DZone
  • ThisNext
  • MisterWong
  • Wists
  • BlinkList
  • blogmarks
  • blogtercimlap
  • connotea
  • DotNetKicks
  • Fark
  • Fleck
  • Gwar
  • Haohao
  • IndianPad
  • Internetmedia
  • LinkaGoGo
  • MyShare
  • Netscape
  • NewsVine
  • Rec6
  • Reddit
  • Scoopeo
  • Slashdot
  • StumbleUpon
  • Technorati
  • Webride

Disitool is a Python Program to Manipulate Embedded Digital Signatures

Executables (PE files) can have a digital signature, Microsoft calls this signature AuthentiCode. There are 2 different ways to sign a PE file: by adding a digital signature to the PE file (embedded digital signature) or by adding a hash of the PE file to a security catalog file (filetype .CAT).

Disitool is a small Python program to manipulate embedded digital signatures

  • delete a signature: disitool.py delete signed-file unsigned-file
  • copy a signature: disitool.py copy signed-source-file unsigned-file signed-file
  • extract a signature: disitool.py extract signed-file signature
  • add a signature: disitool.py add signature unsigned-file signed-file
  • inject data after the authenticode signature: disitool.py inject [--paddata] signed-source-file data-file signed-destination-file

It is not a tool to digitally sign executables, use signtool for this. When you add or copy a signature from one file to another file, the signature will not be valid.

disitool uses pefile, you’ll need to install this first. This new version (V0.2) will update the PE header checksum.

Download:

disitool_v0_3.zip (https)

MD5: 08D1CA036DC905D8E42AB3016A1B7821

SHA256: AEF923F49E53C7C2194058F34A73B293D21448DEB7E2112819FC1B3B450347B8

–Form Disitool

Share and Enjoy:
  • Digg
  • del.icio.us
  • Netvouz
  • DZone
  • ThisNext
  • MisterWong
  • Wists
  • BlinkList
  • blogmarks
  • blogtercimlap
  • connotea
  • DotNetKicks
  • Fark
  • Fleck
  • Gwar
  • Haohao
  • IndianPad
  • Internetmedia
  • LinkaGoGo
  • MyShare
  • Netscape
  • NewsVine
  • Rec6
  • Reddit
  • Scoopeo
  • Slashdot
  • StumbleUpon
  • Technorati
  • Webride