Skip to content

Bugfix: pop checkpoint resume from kwargs in experiments #4913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sjawhar
Copy link
Contributor

@sjawhar sjawhar commented Nov 19, 2020

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Purpose

I believe #4855 introduced a regression to do with experiments, checkpoints, and run cache. #4911 seems to have tried to address it, but I still get the same error:

2020-11-19 19:10:05,223 ERROR: Failed to reproduce experiment 'a32ea21' - failed to reproduce 'dvc.yaml': restore() got an unexpected keyword argument 'checkpoint_resume'
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/reproduce.py", line 175, in _reproduce_stages
    ret = _reproduce_stage(stage, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/repo/reproduce.py", line 36, in _reproduce_stage
    stage = stage.reproduce(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/usr/local/lib/python3.8/dist-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/stage/__init__.py", line 339, in reproduce
    self.run(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/usr/local/lib/python3.8/dist-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/stage/__init__.py", line 465, in run
    run_stage(self, dry, force, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dvc/stage/run.py", line 106, in run_stage
    stage.repo.stage_cache.restore(stage, **kwargs)
TypeError: restore() got an unexpected keyword argument 'checkpoint_resume'

Approach

After digging through the code, I found a spot where it seems like the checkpoint_resume parameter should be removed from the set of kwargs before continuing. I don't see checkpoint_resume being used anywhere outside of dvc.repo.experiments.new and dvc.repo.experiments._resume_checkpoint(), and the former calls the latter and is the only function to do so, so I think this fix is correct.

I spent a lot of time trying to write a test that captured this bug, but I failed. 😒 When I compare what's happening in the stack trace between my repo and the test case, I find a difference here:

dvc/dvc/repo/reproduce.py

Lines 169 to 172 in 4cf2f81

if checkpoint_func:
kwargs["checkpoint_func"] = partial(
_repro_callback, checkpoint_func, unchanged
)

In my repo kwargs["checkpoint_func"] gets set to None, whereas it's a function in the test case. Then when we get to dvc.stage.run.run_stage(), we hit this condition:

dvc/dvc/stage/run.py

Lines 101 to 109 in 4cf2f81

def run_stage(stage, dry=False, force=False, checkpoint_func=None, **kwargs):
if not (dry or force or checkpoint_func):
from .cache import RunCacheNotFoundError
try:
stage.repo.stage_cache.restore(stage, **kwargs)
return
except RunCacheNotFoundError:
pass

Which then causes the error.

@efiop efiop requested a review from pmrowla November 19, 2020 22:14
@sjawhar sjawhar force-pushed the hotfix/experiment-checkpoint-resume branch from 526830a to dab32ea Compare November 19, 2020 23:46
@pmrowla pmrowla added bugfix fixes bug skip-changelog Skips changelog labels Nov 20, 2020
@pmrowla pmrowla merged commit 153c374 into iterative:master Nov 20, 2020
@pmrowla
Copy link
Contributor

pmrowla commented Nov 20, 2020

Thanks for fixing this! πŸ™

@sjawhar sjawhar deleted the hotfix/experiment-checkpoint-resume branch November 20, 2020 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes bug skip-changelog Skips changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants