Skip to content

Add backend switch for IterDataPipe.__getstate__ #341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ejguan opened this issue Apr 1, 2022 · 5 comments
Open

Add backend switch for IterDataPipe.__getstate__ #341

ejguan opened this issue Apr 1, 2022 · 5 comments

Comments

@ejguan
Copy link
Contributor

ejguan commented Apr 1, 2022

I see. I can confirm that we will rely on this DILL_AVAILABLE for any place in TorchData project to determine if dill is available or not.

The problem is that we will need to patch every single module where this is imported. You cannot patch the place where it is defined, but rather where it is used. If you look above, we are not patching ._utils.serialization but rather .datapipes.datapipe, because this is where the flag is used. If we now need use this flag in multiple modules, we need to patch all of them. This is very brittle.

It's doable, but I am not sure if we want to do so because the goal of automatically using dill is to reduce the work users need to figure out if the DataPipe is serializable with lambda function.

Not sure I understand. If we just keep the same detection as we have now, users that don't care should not see any difference. If dill is available, it will be picked up and otherwise pickle will be used. But it would give users the option to enforce a particular backend if they need to. Without this option, the environment you use has an effect on the functionality and there is no way change that. I don't think this is good design.

Even if you don't do it for the users, think about how you want to test pickle vs dill yourself. Right now the only option is to have two separate workflows one with dill installed and one without.

Originally posted by @pmeier in pytorch/vision#5711 (comment)

@NivekT
Copy link
Contributor

NivekT commented May 19, 2022

Do we still want a backend switch? I think we have revamped the usage of dill in IterDataPipe by moving the attempts to use dill to traversal, DataLoader (V1). Changes to DLv2 is still a TODO.

pytorch/pytorch#74984

@ejguan
Copy link
Contributor Author

ejguan commented May 19, 2022

I think with SerializationWrapper pytorch/pytorch#77288, the similar functionality can be extended to DataLoader2 easily.

@pmeier
Copy link
Contributor

pmeier commented May 20, 2022

Do we still want a backend switch?

Unless torchdata gives a guarantee that the behavior is exactly the same whether dill is installed or not, IMO we do. Of course that doesn't mean that more functionality (lambda's, local functions, ...) can be supported if dill is available. It just means if a workflow works without dill is must also work with dill.

The reason we (torchvision) want this switch is that we need to make sure our datasets work in "arbitrary" environments. Unless we have the guarantee from above or a switch this means we need to run the whole test suite twice just to make sure having dill in the environment doesn't make a difference.

cc @NicolasHug

@NicolasHug
Copy link
Member

Unless torchdata gives a guarantee that the behavior is exactly the same whether dill is installed or not

It was my understanding from past meetings that this would be the case. Perhaps one of you can confirm?

In which case, as Philip already mentioned, we won't need the switch (I might still keep pytorch/vision#5938 open for a bit though, to help me sleep at night :) )

@NivekT
Copy link
Contributor

NivekT commented May 20, 2022

I agree that the goal is still have the same behavior whether dill is installed or not and this landed PR should make sure of that for DataLoader V1. I am looking into the same for DataLoader V2 in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants