Skip to content

ENH: .read_pickle(...) from zip containing hidden OS X/macOS metadata files/folders #37101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

ml-evs
Copy link

@ml-evs ml-evs commented Oct 13, 2020

This PR allows for .zip files created by OS X and macOS that contain __MACOSX and .DS_STORE metadata folders to be loaded by read_pickle/read_csv/read_table/read_json without error.

It does not work with pd.read_csv(...) as in that case the compression is handled in the C code. If this enhancement is deemed desirable then I'm willing to have a go at writing it (and the test is already written).

(No longer true as of #36997)

Other similar folders may exist from other operating systems, in which case the list could be pulled out as a constant which could be used in the tests and in the module itself.

@pep8speaks
Copy link

pep8speaks commented Oct 13, 2020

Hello @ml-evs! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-19 15:01:58 UTC

@ml-evs ml-evs force-pushed the ml-evs/fix_#37098 branch 2 times, most recently from 46fce7c to bd7a78a Compare October 13, 2020 14:57
@ml-evs ml-evs force-pushed the ml-evs/fix_#37098 branch from bd7a78a to d89aafb Compare October 13, 2020 15:04
@ml-evs ml-evs force-pushed the ml-evs/fix_#37098 branch from d89aafb to 73ac7ca Compare October 13, 2020 17:26
@twoertwein
Copy link
Member

I cannot comment on whether this PR is 'desirable' :)

I think it it should also work for pd.read_csv(..., engine="python"), or? And if I manage to get #36997 working, it will also work with the c engine.

@ml-evs
Copy link
Author

ml-evs commented Oct 14, 2020

I cannot comment on whether this PR is 'desirable' :)

I wouldn't want to argue particularly hard for it myself! I guess it depends how prevalent this "problem" is, but I won't be too upset if this doesn't make it in 🙃

I think it it should also work for pd.read_csv(..., engine="python"), or? And if I manage to get #36997 working, it will also work with the c engine.

It does indeed work with the Python engine, the tests in tests/io/parsers/test_compression.py are run on every engine, I'm happy to push the failing tests if that would useful. Can also rebase onto your PR and check if that works too.

I'll hold off until there's been a discussion about whether this is desirable (either here or in the issue #37098).

@twoertwein
Copy link
Member

It does indeed work with the Python engine, the tests in tests/io/parsers/test_compression.py are run on every engine, I'm happy to push the failing tests if that would useful. Can also rebase onto your PR and check if that works too.

I'll hold off until there's been a discussion about whether this is desirable (either here or in the issue #37098).

I wouldn't wait for my PR (I'm not sure how long it will take me to debug two failing tests on windows). I will ping you when/if the PR gets merged. Including test for the python engine would be good.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only an issue on read? Not write?

) as p2, tm.ensure_clean(dummy) as d:
df = tm.makeDataFrame()

# write to uncompressed file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove all of these comments

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? I'm just following the style of other tests in this file. I've made the comments more succinct but wouldn't want to remove them altogether as this is a bit of a weird edge-case to test.

@ml-evs
Copy link
Author

ml-evs commented Oct 25, 2020

Is this only an issue on read? Not write?

I think it's only a problem when the zip has been made inside the macOS file manager, rather than say, at the command-line (though I do not have a Mac to test this).

As I said before, maybe this is a non-issue.

@twoertwein
Copy link
Member

@ml-evs #36997 is merged: you PR will also work for read_csv (both c engine and python engine).

@ml-evs
Copy link
Author

ml-evs commented Nov 4, 2020

@ml-evs #36997 is merged: you PR will also work for read_csv (both c engine and python engine).

This seems to be the case, thanks @twoertwein!

@ml-evs
Copy link
Author

ml-evs commented Nov 19, 2020

Happy to close this if there's no interest; will re-ping the initial reviewers just in case.

@ml-evs ml-evs changed the title ENH: load_pickle(...) from zip containing hidden OS X/macOS metadata files/folders ENH: .read_pickle(...) from zip containing hidden OS X/macOS metadata files/folders Nov 19, 2020
@jbrockmendel
Copy link
Member

ive tried, but have been unable to form an opinion

@twoertwein
Copy link
Member

twoertwein commented Nov 19, 2020

a more generic approach might be to use {'method': 'zip', 'file': 'path/within/zip'} to let the user choose which file in a ZIP file should be used? This is more work for the user but allows to read different files within a ZIP file.

Technically this can already be achieved with

from zipfile import ZipFile

import pandas as pd

with ZipFile("test.zip", mode="r") as file:
    with file.open("data.pickle", mode="r") as file:
        pd.read_pickle(file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants