Add read_pdf to IO Tools #4556

cancan101 · 2013-08-13T18:14:00Z

related #3281

Create a read_pdf method in IO tools for reading tables from PDF documents. Many data sets are released in PDF form.

For example:

Walmart's Historical Unit Count and Square Footage: http://az204679.vo.msecnd.net/media/documents/unit-counts-q1-fy14_130131488115936836.pdf
Las Vegas Visitor Statistics: http://www.lvcva.com/includes/content/images/media/docs/ES-YTD20128.pdf
Coffee export data: http://www.ico.org/historical/2010-19/PDF/EXPCALY.pdf

There are a number of standalone tools, projects for this:

http://ieg.ifs.tuwien.ac.at/projects/pdf2table/ (an academic project /paper; Java)
https://pypi.python.org/pypi/pdfquery (Python lib)
https://pypi.python.org/pypi/pdftable (Python lib)
http://www.unixuser.org/~euske/python/pdfminer/ (Python lib)
https://github.com/ashima/pdf-table-extract (Python lib)

There are also a number of site / projects to convert PDF to HTML:

https://github.com/coolwanglu/pdf2htmlEX/wiki (open source)

cpcloud · 2013-08-13T18:49:09Z

lightning fast (and slightly superficial) review of the above tools:

~~http://ieg.ifs.tuwien.ac.at/projects/pdf2table/~~ (web page hasn't been updated since 2005, also in java, but maybe worth a quick glance)
https://github.com/jcushman/pdfquery looks interesting (very general lib for reading pdfs, might be overkill for just reading tables, definitely worth more investigation)
~~https://pypi.python.org/pypi/pdftable~~ (looks like it may have been abandoned)
http://www.unixuser.org/~euske/python/pdfminer/ (maybe, i can't tell with this one)
https://github.com/ashima/pdf-table-extract (worth further investigation, might be adaptable to pandas very easily since it uses numpy)
https://github.com/coolwanglu/pdf2htmlEX/wiki (not sure, worth a deeper look)

cancan101 · 2013-08-13T18:56:55Z

I should add that that list is not comprehensive. Especially the list of projects to convert PDFs to HTML (or text or CSV).

cancan101 · 2013-08-13T19:24:53Z

Here are some additional sources:

ghost · 2014-01-24T15:12:14Z

As in several recent FRs, I'll reiterate that pandas accepts data. we do not need to swallow
up every possible dependeny for data sources and reify it, partially, badly into our exposed API.
The maintainer of ashima/pdf-table-extract seems to have come to the same conclusion.

We want pandas to make it easy to work with any reasonable data source, that does not
require we provide a specialized API for it. "data, the ultimate data API" (hello clojure!)

Where we can provide significantly better workflow or solve a major pain point it's fine to add on
to pandas. In all cases where it's trivial to compose an existing external library with pandas
to get work done, users should just do that.

Please, no more of this "library X can read Y, you should have X as a dep and add read_Y to the API",
unless there's a tangible added benefit that comes out of it.
The user is ultimately much more empowered by composing focused tools then by having
the LCM of the combination. and the devs are better off not fighting yet another vector for
method signture feature creep. #6029

just "use X".

cancan101 mentioned this issue Aug 28, 2013

ENH: Add decimal parameter to to_numeric #4674

Open

This was referenced Sep 24, 2013

Consider merging with Pandas ashima/pdf-table-extract#6

Closed

Support pdf tables -> datframe #3281

Closed

ghost closed this as completed Jan 24, 2014

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add read_pdf to IO Tools #4556

Add read_pdf to IO Tools #4556

cancan101 commented Aug 13, 2013

cpcloud commented Aug 13, 2013

cancan101 commented Aug 13, 2013

cancan101 commented Aug 13, 2013

ghost commented Jan 24, 2014

Add read_pdf to IO Tools #4556

Add read_pdf to IO Tools #4556

Comments

cancan101 commented Aug 13, 2013

cpcloud commented Aug 13, 2013

cancan101 commented Aug 13, 2013

cancan101 commented Aug 13, 2013

ghost commented Jan 24, 2014