Skip to content

Pandas can't handle zipfile.Path objects (ValueError: Invalid file path or buffer object type: <class 'zipfile.Path'>) #49906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
buhtz opened this issue Nov 25, 2022 · 8 comments
Labels
Enhancement IO Data IO issues that don't fit into a more specific label

Comments

@buhtz
Copy link

buhtz commented Nov 25, 2022

This is reproducible in current latest Pandas 1.5.2.

In Python the zipfile.Path class is intendent to act similar (but not absolute equal!) to pathlib.Path. The latter is accepted by pandas but not the first.

Steps to reproduce:

  1. Create a zip file named foo.zip with one an csv-file in it named bar.csv.
  2. Create a path object directly pointing to that csv file in the zip file: zp = zipfile.Path('foo.zip', 'bar.csv')
  3. Use that path object (zp) in pandas.read_csv() as path object.

Because of that part of your code

pandas/pandas/io/common.py

Lines 446 to 452 in 3b09765

# is_file_like requires (read | write) & __iter__ but __iter__ is only
# needed for read_csv(engine=python)
if not (
hasattr(filepath_or_buffer, "read") or hasattr(filepath_or_buffer, "write")
):
msg = f"Invalid file path or buffer object type: {type(filepath_or_buffer)}"
raise ValueError(msg)

Python raise an " ValueError: Invalid file path or buffer object type: <class 'zipfile.Path'>".

EDIT:
I'm aware that pandas.read_csv() do offer the compressions argument and can read compressed csv files by its own. But this doesn't help in my case. I'm using pandas as a backend for a more higher level API reading data files. Pandas is just one part of it. And one shortcoming of pandas here is that it is not able to deal with ZIP files containing multiple CSV files.

pathlib.Path and zipfile.Path are standard python. And pandas IMHO should be able to deal with it.

@twoertwein
Copy link
Member

Pandas works with all os.PathLike objects. zipfile.Path doesn't implement the os.PathLike protocol as it doesn't have __fspath__.

Since the python documentation of zipfile.Path claims that it is compatible with pathlib.Path (which has __fspath__), I would recommend opening a bug at cpython.

@buhtz
Copy link
Author

buhtz commented Nov 27, 2022

Thanks for your analysis.
Where does pandas need __fspath__?

@twoertwein
Copy link
Member

Where does pandas need __fspath__?

Here

if isinstance(filepath_or_buffer, os.PathLike):

@twoertwein
Copy link
Member

I'm not familiar enough with what zipfile.Path does (or why it even exists - why not use pathlib) - it might be okay to have an elif branch to check for zipfile.Path and then convert it to a str?

@buhtz
Copy link
Author

buhtz commented Nov 27, 2022

I'm not familiar enough with what zipfile.Path does (or why it even exists - why not use pathlib)

That is a good question. Let me explain.
The point is you can "access" files inside a zip-archive without explicit open and reading that zip file.

Let's say you have a zip-file containing three "data" files (2x csv, 1x excel). I would like to read them like this:

import zipfile
import pandas

fpa = zipfile.Path('data.zip', 'entryA.csv')
fpb = zipfile.Path('data.zip', 'entryB.csv')
fpc = zipfile.Path('data.zip', 'entryC.xlsx')

# that is possible
with fpa.open('r') as handle:
   csv_content = fpa.read()

# but pandas isn't able to do it like this
df = pandas.read_csv(fpa)

I have kind of a pandas-wrapper (buhtzology) for my own daily work. There in buhtzology.bandas.read_and_validate_excel() you can see how I workaround that shortcoming. Just have a closer look at _file_path_or_zip_path_as_buffer() where I decide if pandas.read_excel() will receive a pathlib.Path instance or an io-object via zipfile.Path.open().

@twoertwein
Copy link
Member

Thank you for describing the use case! In that case, it doesn't make sense to convert it to a str (and adding __fspath__ makes probably also no sense) as the file inside the zip file cannot be opened using open().

The reason why df = pandas.read_csv(fpa) doesn't work is that fpa doesn't have read/write. This should work:

with fpa.open('r') as handle:
    df = pandas.read_csv(handle)  # might need to specify mode="rb"

@buhtz
Copy link
Author

buhtz commented Nov 27, 2022

The reason why df = pandas.read_csv(fpa) doesn't work is that fpa doesn't have read/write. This should work:

with fpa.open('r') as handle:
    df = pandas.read_csv(handle)  # might need to specify mode="rb"

That is exactly what I'm doing here in my workaround.

But other file reading libraries that I've tested doesn't have problems with zipfile.Path and don't need such a workaround. E.g. Pythons csv.reader() can handle such path objects well as you can see here.

@twoertwein
Copy link
Member

I think an option to accommodate that would be:

  1. first handle fsspec or urllib
  2. convert remaining strings to pathlib.Path
  3. call .open(mode=..., encoding=..., errors=..., newline=...) on objects that do not have read/write - works for both Paths

@twoertwein twoertwein added IO Data IO issues that don't fit into a more specific label Enhancement labels Nov 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

2 participants