Skip to content

checkpoints: exp run and exp res[ume] refactor #4855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Nov 11, 2020

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Nov 6, 2020

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Related to #4821.

  • checkpoint stage detection is handled automatically based on whether or not a stage contains checkpoint outputs
  • dvc exp run --checkpoint is replaced with dvc exp run
    • To force reset an existing checkpoint run from scratch, use dvc exp run -f.
  • dvc exp run --continue is replaced with dvc exp res[ume]

@pmrowla pmrowla added the A: experiments Related to dvc exp label Nov 6, 2020
@pmrowla pmrowla self-assigned this Nov 6, 2020
@pmrowla pmrowla force-pushed the exp-resume branch 2 times, most recently from 5ce2482 to 414d513 Compare November 9, 2020 06:55
* since executor has to move to threadpool execution anyways, we no
  longer have to work around the pickle repro result limitation
* regular outputs may not be written at the time a checkpoint is
  generated, but subsequent checkouts of that stage should not fail due
  to the missing outs
* refactor signaling (use threading.Event instead of Condition)
* fix windows bug (kill all child processes via taskkill)
Comment on lines -157 to -183
import networkx as nx

if single_item:
all_pipelines = stages
else:
all_pipelines = []
for stage in stages:
if downstream:
# NOTE (py3 only):
# Python's `deepcopy` defaults to pickle/unpickle the object.
# Stages are complex objects (with references to `repo`,
# `outs`, and `deps`) that cause struggles when you try
# to serialize them. We need to create a copy of the graph
# itself, and then reverse it, instead of using
# graph.reverse() directly because it calls `deepcopy`
# underneath -- unless copy=False is specified.
nodes = nx.dfs_postorder_nodes(
G.copy().reverse(copy=False), stage
)
all_pipelines += reversed(list(nodes))
else:
all_pipelines += nx.dfs_postorder_nodes(G, stage)

pipeline = []
for stage in all_pipelines:
if stage not in pipeline:
pipeline.append(stage)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is just moved into _get_pipeline() to fix flake8 complexity error

def _kill_nt(proc):
# windows stages are spawned with shell=True, proc is the shell process and
# not the actual stage process - we have to kill the entire tree
subprocess.call(["taskkill", "/F", "/T", "/PID", str(proc.pid)])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that we could consider doing via psutil, which has the logic needed to actually walk Windows process trees and find each child process individually (native python does not provide an easy way to do that). psutil has Windows wheels for <3.9 but still does not have any wheels for linux/mac, so if we were going to re-introduce it as a dependency it would have to be Windows only.

But calling taskkill here also works for our purposes

@pmrowla pmrowla marked this pull request as ready for review November 10, 2020 09:07
@pmrowla pmrowla changed the title [WIP] checkpoints: exp run and exp res[ume] refactor checkpoints: exp run and exp res[ume] refactor Nov 10, 2020
@pmrowla pmrowla requested review from skshetry, pared and efiop November 10, 2020 09:10
@pmrowla pmrowla added ui user interface / interaction skip-changelog Skips changelog labels Nov 11, 2020
@pmrowla pmrowla merged commit 389efd7 into iterative:master Nov 11, 2020
@pmrowla pmrowla deleted the exp-resume branch November 11, 2020 05:19
I159 added a commit to I159/dvc that referenced this pull request Nov 12, 2020
…limit

* 'master' of github.com:iterative/dvc:
  dag: add --outs option (iterative#4739)
  Add test server and tests for webdav (iterative#4827)
  Simpler param updates with python-benedict (iterative#4780)
  checkpoints: set DVC_ROOT environment variable (iterative#4877)
  api: add support for simple wildcards (iterative#4864)
  tests: mark azure test as flaky (iterative#4881)
  setup.py: limit responses version for moto (iterative#4879)
  remote: avoid chunking on webdav. Fixes iterative#4796 (iterative#4828)
  checkpoints: `exp run` and `exp res[ume]` refactor (iterative#4855)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp skip-changelog Skips changelog ui user interface / interaction
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant