Extract Table from PDF with Python

I have done a freelance job that extracted table from PDF with the help of pdftohtml(part of xpdf) and other pdf software that help with this such as sodapdf. First converted PDF to XML, then parsed XML and got csv.

Today I’d like to introduced two packages that can easy convert PDF to CSV or Excel, they are pdfplumber and tabula-py(need JVM).

pdfplumber, Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.
Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer and pdfminer.six.
tabula-py is a simple Python wrapper of tabula-java, which can read table of PDF. You can read tables from PDF and convert into pandas’s DataFrame. tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file.

Let install them first,

The test PDF files we will use today are
http://www.cntaiping.com/upload/cms/cntaiping/201809/14174039r1iv.pdf

and
http://202.107.205.11:8612/doc/浙高企认办〔2018〕3号关于浙江省2018年第一批拟更名高新技术企业公示名单.pdf
rename to hangzhou.pdf

the result is bad, let’s try pdfplumber

sounds great. Let’s try another PDF

bad result

perfect!

from the simple test, pdfplumber is much better than tabula-py and pdfplumber is developed in pure Python.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.