Skip to content

Add read_pdf to IO Tools #4556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cancan101 opened this issue Aug 13, 2013 · 4 comments
Closed

Add read_pdf to IO Tools #4556

cancan101 opened this issue Aug 13, 2013 · 4 comments
Labels
API Design Enhancement IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@cancan101
Copy link
Contributor

related #3281

Create a read_pdf method in IO tools for reading tables from PDF documents. Many data sets are released in PDF form.

For example:

There are a number of standalone tools, projects for this:

There are also a number of site / projects to convert PDF to HTML:

@cpcloud
Copy link
Member

cpcloud commented Aug 13, 2013

lightning fast (and slightly superficial) review of the above tools:

@cancan101
Copy link
Contributor Author

I should add that that list is not comprehensive. Especially the list of projects to convert PDFs to HTML (or text or CSV).

@ghost
Copy link

ghost commented Jan 24, 2014

As in several recent FRs, I'll reiterate that pandas accepts data. we do not need to swallow
up every possible dependeny for data sources and reify it, partially, badly into our exposed API.
The maintainer of ashima/pdf-table-extract seems to have come to the same conclusion.

We want pandas to make it easy to work with any reasonable data source, that does not
require we provide a specialized API for it. "data, the ultimate data API" (hello clojure!)

Where we can provide significantly better workflow or solve a major pain point it's fine to add on
to pandas. In all cases where it's trivial to compose an existing external library with pandas
to get work done, users should just do that.

Please, no more of this "library X can read Y, you should have X as a dep and add read_Y to the API",
unless there's a tangible added benefit that comes out of it.
The user is ultimately much more empowered by composing focused tools then by having
the LCM of the combination. and the devs are better off not fighting yet another vector for
method signture feature creep. #6029

just "use X".

@ghost ghost closed this as completed Jan 24, 2014
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

2 participants