Bugfix: pop checkpoint resume from kwargs in experiments #4913
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
Purpose
I believe #4855 introduced a regression to do with experiments, checkpoints, and run cache. #4911 seems to have tried to address it, but I still get the same error:
Approach
After digging through the code, I found a spot where it seems like the
checkpoint_resume
parameter should be removed from the set ofkwargs
before continuing. I don't seecheckpoint_resume
being used anywhere outside ofdvc.repo.experiments.new
anddvc.repo.experiments._resume_checkpoint()
, and the former calls the latter and is the only function to do so, so I think this fix is correct.I spent a lot of time trying to write a test that captured this bug, but I failed. π’ When I compare what's happening in the stack trace between my repo and the test case, I find a difference here:
dvc/dvc/repo/reproduce.py
Lines 169 to 172 in 4cf2f81
In my repo
kwargs["checkpoint_func"]
gets set to None, whereas it's a function in the test case. Then when we get todvc.stage.run.run_stage()
, we hit this condition:dvc/dvc/stage/run.py
Lines 101 to 109 in 4cf2f81
Which then causes the error.