Skip to content

fix HttpResource.resolve() with preprocessing #5669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 23, 2022

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Mar 23, 2022

#5282 removed the preprocess flag from OnlineResource. In #5584 and #5667 it was discovered that this broke the use case if either decompress or extract is set on a HttpResource and the URL redirects. @YosuaMichael provided a solution in #5667, but I think it will be much better to have a more general solution.

This PR removes the decompress and extract parameters and re-adds preprocess. It can be any Callable[[pathlib.Path], pathlib.Path]. To keep the old convenience, one can also pass preprocess="decompress" or preprocess="extract".

@@ -31,19 +32,16 @@ def __init__(
*,
file_name: str,
sha256: Optional[str] = None,
decompress: bool = False,
extract: bool = False,
preprocess: Optional[Union[Literal["decompress", "extract"], Callable[[pathlib.Path], pathlib.Path]]] = None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also go for an enum here, but not sure if that would be overkill.

@facebook-github-bot
Copy link

facebook-github-bot commented Mar 23, 2022

💊 CI failures summary and remediations

As of commit 8f07a4e (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



🚧 3 ongoing upstream failures:

These were probably caused by upstream breakages that are not fixed yet.


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @pmeier

If i remember correctly, the way redirections are done is fairly hacky. Would it be worth thinking of a roboust way of doing these?

@pmeier
Copy link
Collaborator Author

pmeier commented Mar 23, 2022

If i remember correctly, the way redirections are done is fairly hacky. Would it be worth thinking of a roboust way of doing these?

Yes, but not now. The thing that doesn't work is redirecting of mirrors and we only have a single dataset with mirrors. In general nothing has changed from the last time we evaluated this. I think we should wait for your investigations on how we want to load data from remote sources, e.g. do we use iopath or something else. If that decision is made, I'm all for refactoring this properly.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock but if we're going to revisit everything anyway, perhaps there are more minimal changes that would work too (and that would be easier to review)? Not sure how tied the proposed changes are to the actual fix here.

@pmeier
Copy link
Collaborator Author

pmeier commented Mar 23, 2022

more minimal changes that would work too (and that would be easier to review)?

IMO, the changes are fairly minimal. As stated in my top comment, have a look at #5667 for an alternative solution. The gist from there is that we internally track whether decompress or extract was set to be able to pass the right values when instantiating a new resource after resolving.

@pmeier pmeier merged commit 151e162 into pytorch:main Mar 23, 2022
@pmeier pmeier deleted the datasets/resolve-preprocess branch March 23, 2022 18:29
facebook-github-bot pushed a commit that referenced this pull request Apr 5, 2022
Summary:
* fix HttpResource.resolve() with preprocess set

* fix README

* add safe guard for invalid str inputs

(Note: this ignores all push blocking failures!)

Reviewed By: datumbox

Differential Revision: D35216797

fbshipit-source-id: c0c2fee98d5a7ade1b6870b11f396632539eb994
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants