Skip to content

exp show: sync state between queue and exp show table #8158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 26, 2022

Conversation

karajan1001
Copy link
Contributor

@karajan1001 karajan1001 commented Aug 22, 2022

fix: #8088

  1. Refactor seperate the initialization of executor and setup environment
  2. Move ref setup into executor.init_git
  3. Add a new attribute status to ExecutorInfo file
  4. Update running status to the executor infofile.
  5. Use task status to replace collected.
  6. Move some basic test script from function tests to unit test.
  7. Add success/failed tests for the status change of tempdir, celery, workspace running case.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@karajan1001 karajan1001 added enhancement Enhances DVC A: experiments Related to dvc exp A: executors labels Aug 22, 2022
@karajan1001 karajan1001 requested a review from pmrowla August 22, 2022 08:46
@karajan1001 karajan1001 self-assigned this Aug 22, 2022
@karajan1001 karajan1001 requested a review from a team as a code owner August 22, 2022 08:46
@karajan1001 karajan1001 marked this pull request as draft August 22, 2022 08:46
@karajan1001 karajan1001 force-pushed the fix8088 branch 2 times, most recently from 68b0d92 to 4854874 Compare August 23, 2022 09:18
@karajan1001 karajan1001 changed the title [WIP] exp show: sync state between queue and exp show table exp show: sync state between queue and exp show table Aug 23, 2022
@karajan1001 karajan1001 marked this pull request as ready for review August 23, 2022 10:11
Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, executor.manager should not be modified in this PR. We don't use it at all (it has been replaced by the queue classes). The code was left in since there is still some legacy dvc-machine/SSH executor related behavior that has not been moved into dvc-machine/SSH queues.

But at this point it's probably better for us to just remove it since it is currently unused, and then restore it once we get back to dvc-machine development.

(but removing it should be done in a separate PR)

@@ -248,7 +253,8 @@ def _load_info(rev: str) -> ExecutorInfo:

def _load_collected(rev: str) -> Optional[ExecutorResult]:
executor_info = _load_info(rev)
if executor_info.collected:
print("executor_info is", executor_info.status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra debugging statement

Comment on lines 92 to 91
scm.set_ref(EXEC_HEAD, entry.head_rev)
scm.set_ref(EXEC_MERGE, stash_rev)
scm.set_ref(EXEC_BASELINE, entry.baseline_rev)
refspec = f"{EXEC_NAMESPACE}/"
push_refspec(scm, self.git_url, refspec, refspec)
finally:
for ref in (EXEC_HEAD, EXEC_MERGE, EXEC_BASELINE):
if scm.get_ref(ref):
scm.remove_ref(ref)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be duplicating this try/finally code across every executor type. It should either stay in queue.base (before we init an executor), or executor classes should be re-using some code from BaseExecutor.

The reason it was outside of executors before (in manager and then queue.base), is that ideally we should be setting and removing those refs at the same level. And whether or not we want to remove them depends on when and why we are generating any executor at all, it is not dependent on specific executor types.

Basically, whether or not to clear the refs after init'ing an executor depends on whether or not DVC needs to clean the entire workspace or not. In practice, this means whether we are doing a workspace run or any other type of run (tempdir, remote machine, etc). But the actual exectuors themselves will function whether or not the refs are cleared after calling init_git. So the caller should be responsible for setting and clearing those refs, not the executor itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, whether or not to clear the refs after init'ing an executor depends on whether or not DVC needs to clean the entire workspace or not.

Hi, @pmrowla thanks for the suggestions but this is not about cleaning the executor references. but cleaning the workspace reference during initialization. The only place where we do not need to clean these references is in the workspace because we need to use these EXEC references later in running.

The reason it was outside of executors before (in manager and then queue.base), is that ideally we should be setting and removing those refs at the same level.

This is the reason why I brought them to this place, the reference set is now in the queue but the clean progress is inside the executor's init. And we had already forgotten to clean it in previous #8043. This is why I now put them all here on the same level.

Comment on lines +187 to +184
scm.set_ref(EXEC_HEAD, entry.head_rev)
scm.set_ref(EXEC_MERGE, stash_rev)
scm.set_ref(EXEC_BASELINE, entry.baseline_rev)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001 this is still a problem where we have a lot of duplicated code (in all of the different executor init methods). If you want to keep this in the executor classes we can do that, but the common code should be refactored so that it goes in the base executor class.

Maybe something like this in the base class

@contextmanager
def set_exec_refs(self):
    # init refs
    scm.set_ref(...)
    try:
        yield
    finally:
        # cleanup refs
        self.scm.remove_ref(...)

and then within each executor init_git you would have

def init_git():
    with self.init_exec_refs():
        # existing init_git code()
        ...
    # any additional executor specific cleanup handling here
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one exception in the WorkspaceExecutor, because it will remove these references after the task has been completed.

@karajan1001 karajan1001 force-pushed the fix8088 branch 2 times, most recently from afef91f to 47f4277 Compare September 9, 2022 08:00
@karajan1001 karajan1001 requested a review from pmrowla September 9, 2022 08:00
@karajan1001
Copy link
Contributor Author

Looks like need to clean the manager or the pylint will fail.

@karajan1001
Copy link
Contributor Author

karajan1001 commented Sep 14, 2022

Excuse me, @mattseddon, Could you please help me to verify if the problem is solved after this PR on the VSCode extension? Thank you.

@mattseddon
Copy link
Contributor

Excuse me, @mattseddon, Could you please help me to verify if the problem is solved after this PR on the VSCode extension? Thank you.

I will test 👍🏻.

@mattseddon
Copy link
Contributor

This is the experience in the extension using dvc queue start -j 1 with 3 queued experiments:

queue-start-j-1.mov

It is an issue that dvc exp show can run into unexpected errors that look like this:

ERROR: unexpected error - Invalid revision: b'02712dc464ab868043e7eefc335a8d5fd39ab6f7'

Question: Is running in exp show synced with TaskStatus.PREPARING in the executor or is there another field in the output that I should be looking for?


Note: Probably unrelated to this change but I started by trying to run with -j 3 and saw some very weird behaviour.

I went through the following steps:

  1. installed git+https://github.com/karajan1001/dvc.git@fix8088 into a demo project's virtual environment
  2. dvc exp run --queue x 3 with different params for each
  3. dvc queue start -j 3
  4. One experiment succeeded and two failed.

After those failures, any attempt to queue an experiment would result in the experiment being run straight away. Even though queue status stated that there were no active workers:

~/demo main !2 ?1 ❯ dvc queue status
Task     Name       Created    Status
e7bf66b             10:35 AM   Failed
ffc912f             10:35 AM   Failed
08a047c  exp-92e70  10:35 AM   Success
6e259ab  exp-684d6  10:41 AM   Success
aad7cf0  exp-14b96  10:40 AM   Success

Worker status: 0 active, 0 idle

~/demo main !2 ?1 ❯ dvc exp run --queue
Queued experiment '2f7751d' for future execution.   
                                                                                                                                                                                                                                                             
~/demo main !2 ?1 ❯ dvc queue status
Task     Name       Created    Status
2f7751d             10:42 AM   Running
e7bf66b             10:35 AM   Failed
ffc912f             10:35 AM   Failed
08a047c  exp-92e70  10:35 AM   Success
6e259ab  exp-684d6  10:41 AM   Success
aad7cf0  exp-14b96  10:40 AM   Success

Worker status: 0 active, 0 idle

The only way that I could get the repo out of this state was to delete .dvc/tmp/exps. dvc queue stop & dvc queue kill had no impact.

@karajan1001
Copy link
Contributor Author

karajan1001 commented Sep 15, 2022

Note: Probably unrelated to this change but I started by trying to run with -j 3 and saw some very weird behaviour.

I went through the following steps:

installed git+https://github.com/karajan1001/dvc.git@fix8088 into a demo project's virtual environment
dvc exp run --queue x 3 with different params for each
dvc queue start -j 3
One experiment succeeded and two failed.
After those failures, any attempt to queue an experiment would result in the experiment being run straight away. Even > though queue status stated that there were no active workers:

For the job count 3, you need to test it after iterative/dvc-task#90 merged.

Question: Is running in exp show synced with TaskStatus.PREPARING in the executor or is there another field in the output that I should be looking for?

exp show reads the TaskStatus of each exps but not only depends on them, because the TaskStatus will only be generated after the exp begins to run.

@mattseddon
Copy link
Contributor

Verbose log for error:

~/projects/vscode-dvc/demo main *1 !4 ?1 ❯ dvc exp show --show-json -v                                                                                                                                                                                                          ✘ 252 18s  .env  base 10:23:44
2022-09-16 10:23:52,332 ERROR: unexpected error - Invalid revision: b'd0b057085e1e96b3406f78cd9cb2decfb86976b3'
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 652, in fetch_refspecs
    check_diverged(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dulwich/porcelain.py", line 347, in check_diverged
    raise DivergedBranches(current_sha, new_sha)
dulwich.porcelain.DivergedBranches: b'd0b057085e1e96b3406f78cd9cb2decfb86976b3'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 745, in diff
    commit_a = self.repo[os.fsencode(rev_a)]
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dulwich/repo.py", line 787, in __getitem__
    return self.object_store[self.refs[name]]
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dulwich/refs.py", line 320, in __getitem__
    raise KeyError(name)
KeyError: b'd0b057085e1e96b3406f78cd9cb2decfb86976b3'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/commands/experiments/show.py", line 475, in run
    all_experiments = self.repo.experiments.show(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 516, in show
    return show(self.repo, *args, **kwargs)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/show.py", line 163, in show
    running = repo.experiments.get_running_exps(fetch_refs=fetch_running)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 443, in get_running_exps
    self._fetch_running_exp(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 482, in _fetch_running_exp
    for ref in executor.fetch_exps(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 358, in fetch_exps
    dest_scm.fetch_refspecs(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 658, in fetch_refspecs
    on_diverged(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 349, in on_diverged_ref
    self._raise_ref_conflict(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 734, in _raise_ref_conflict
    if scm.diff(orig_rev, new_rev):
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 748, in diff
    raise RevError("Invalid revision") from exc
scmrepo.exceptions.RevError: Invalid revision
------------------------------------------------------------
2022-09-16 10:23:52,444 DEBUG: Removing '/Users/mattseddon/projects/vscode-dvc/.UUrYrHCXCoi7th3ronGrot.tmp'
2022-09-16 10:23:52,444 DEBUG: Removing '/Users/mattseddon/projects/vscode-dvc/.UUrYrHCXCoi7th3ronGrot.tmp'
2022-09-16 10:23:52,445 DEBUG: Removing '/Users/mattseddon/projects/vscode-dvc/.UUrYrHCXCoi7th3ronGrot.tmp'
2022-09-16 10:23:52,445 DEBUG: Removing '/Users/mattseddon/projects/vscode-dvc/demo/.dvc/cache/.Smc3Bw3TUaCcMUU6quBhPH.tmp'
2022-09-16 10:23:52,445 DEBUG: Version info for developers:
DVC version: 1.0.2.dev2348+gb4beb4e8 
---------------------------------
Platform: Python 3.9.13 on macOS-12.6-arm64-arm-64bit
Subprojects:
        dvc_data = 0.7.1
        dvc_objects = 0.2.2
        dvc_render = 0.0.10
        dvc_task = 0.1.2
        dvclive = 0.10.0
        scmrepo = 0.1.1
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.8.2, boto3 = 1.24.59)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc (subdir), git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-09-16 10:23:52,446 DEBUG: Analytics is enabled.
2022-09-16 10:23:52,482 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmpyxrdv_qe']'
2022-09-16 10:23:52,484 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmpyxrdv_qe']'

Reason:
1. Need some more granular control over initialization progress of
   executor.

What done:
1. Seperate `init_git` and `init_cache` progress out from setup_executor
2. Move `set_ref` into `init_git`
1. Add a new attribute status to ExecutorInfo file
2. Update running status to the executor infofile.
fix: iterative#8088

1. Use status to replace collected.
2. Make status show more accurate.

Fix lint
1. Move some basic test script from function tests to unit test.
2. Add success/failed tests for the status change of `tempdir`, `celery`, `workspace` running case.
@karajan1001
Copy link
Contributor Author

karajan1001 commented Sep 19, 2022

@mattseddon

The previous problem was because the collect result progress failed in the git pull operations ( because of duplicated experiment names), and make the final result collection failed. Now I had moved all ending dump operations to the cleanup function, in which can guarantee them to be run in a finally scope.


Could you please try it again,(I guess your local repo might be polluted in the previous test, and might need to clean the result manually, building a completely new workspace might help, but the previous error was caused in a dirty env, the newly built one might not trigger the previous error)?

Currently infofile dump was spread in both executor and queue classes.
put them all into executor to make them more managable. Move all ending
dump operations to the cleanup function, because the collect_result progress
might failed because of git operations, in which can guarantee them to
be run in a `finally` scope.
@mattseddon
Copy link
Contributor

Could you please try it again

I will test today.

@mattseddon
Copy link
Contributor

@karajan1001 I'm still seeing the same behaviour. Even with a fresh clone of https://github.com/iterative/vscode-dvc:

dvc exp show --show-json -v
2022-09-20 15:36:25,649 ERROR: unexpected error - Invalid revision: b'1dbb61e08d6d96c6db910a6d392cb1a1bdb9d04d'
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 652, in fetch_refspecs
    check_diverged(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dulwich/porcelain.py", line 347, in check_diverged
    raise DivergedBranches(current_sha, new_sha)
dulwich.porcelain.DivergedBranches: b'2e1b8fbd00600a7457fb91fa14fa7d248a73913b'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 746, in diff
    commit_b = self.repo[os.fsencode(rev_b)]
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dulwich/repo.py", line 787, in __getitem__
    return self.object_store[self.refs[name]]
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dulwich/refs.py", line 320, in __getitem__
    raise KeyError(name)
KeyError: b'1dbb61e08d6d96c6db910a6d392cb1a1bdb9d04d'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/commands/experiments/show.py", line 475, in run
    all_experiments = self.repo.experiments.show(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 516, in show
    return show(self.repo, *args, **kwargs)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/show.py", line 163, in show
    running = repo.experiments.get_running_exps(fetch_refs=fetch_running)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 443, in get_running_exps
    self._fetch_running_exp(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 482, in _fetch_running_exp
    for ref in executor.fetch_exps(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 363, in fetch_exps
    dest_scm.fetch_refspecs(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 658, in fetch_refspecs
    on_diverged(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 354, in on_diverged_ref
    self._raise_ref_conflict(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 739, in _raise_ref_conflict
    if scm.diff(orig_rev, new_rev):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 748, in diff
    raise RevError("Invalid revision") from exc
scmrepo.exceptions.RevError: Invalid revision
------------------------------------------------------------
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/.RzXRu93vW73iJFXLaqEGG2.tmp'
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/.RzXRu93vW73iJFXLaqEGG2.tmp'
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/.RzXRu93vW73iJFXLaqEGG2.tmp'
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/demo/.dvc/cache/.SR4jZgarDBcjiLMyqZnbZg.tmp'
2022-09-20 15:36:25,726 DEBUG: Version info for developers:
DVC version: 1.0.2.dev2371+g94c458d3 
---------------------------------
Platform: Python 3.9.13 on macOS-12.6-arm64-arm-64bit
Subprojects:
        dvc_data = 0.10.0
        dvc_objects = 0.4.0
        dvc_render = 0.0.11
        dvc_task = 0.1.2
        dvclive = 0.10.0
        scmrepo = 0.1.1
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc (subdir), git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-09-20 15:36:25,727 DEBUG: Analytics is enabled.
2022-09-20 15:36:25,758 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmp04tcned9']'
2022-09-20 15:36:25,760 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmp04tcned9']'

@karajan1001
Copy link
Contributor Author

@karajan1001 I'm still seeing the same behaviour. Even with a fresh clone of https://github.com/iterative/vscode-dvc:

dvc exp show --show-json -v
2022-09-20 15:36:25,649 ERROR: unexpected error - Invalid revision: b'1dbb61e08d6d96c6db910a6d392cb1a1bdb9d04d'
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 652, in fetch_refspecs
    check_diverged(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dulwich/porcelain.py", line 347, in check_diverged
    raise DivergedBranches(current_sha, new_sha)
dulwich.porcelain.DivergedBranches: b'2e1b8fbd00600a7457fb91fa14fa7d248a73913b'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 746, in diff
    commit_b = self.repo[os.fsencode(rev_b)]
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dulwich/repo.py", line 787, in __getitem__
    return self.object_store[self.refs[name]]
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dulwich/refs.py", line 320, in __getitem__
    raise KeyError(name)
KeyError: b'1dbb61e08d6d96c6db910a6d392cb1a1bdb9d04d'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/commands/experiments/show.py", line 475, in run
    all_experiments = self.repo.experiments.show(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 516, in show
    return show(self.repo, *args, **kwargs)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/show.py", line 163, in show
    running = repo.experiments.get_running_exps(fetch_refs=fetch_running)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 443, in get_running_exps
    self._fetch_running_exp(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 482, in _fetch_running_exp
    for ref in executor.fetch_exps(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 363, in fetch_exps
    dest_scm.fetch_refspecs(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 658, in fetch_refspecs
    on_diverged(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 354, in on_diverged_ref
    self._raise_ref_conflict(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 739, in _raise_ref_conflict
    if scm.diff(orig_rev, new_rev):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 748, in diff
    raise RevError("Invalid revision") from exc
scmrepo.exceptions.RevError: Invalid revision
------------------------------------------------------------
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/.RzXRu93vW73iJFXLaqEGG2.tmp'
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/.RzXRu93vW73iJFXLaqEGG2.tmp'
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/.RzXRu93vW73iJFXLaqEGG2.tmp'
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/demo/.dvc/cache/.SR4jZgarDBcjiLMyqZnbZg.tmp'
2022-09-20 15:36:25,726 DEBUG: Version info for developers:
DVC version: 1.0.2.dev2371+g94c458d3 
---------------------------------
Platform: Python 3.9.13 on macOS-12.6-arm64-arm-64bit
Subprojects:
        dvc_data = 0.10.0
        dvc_objects = 0.4.0
        dvc_render = 0.0.11
        dvc_task = 0.1.2
        dvclive = 0.10.0
        scmrepo = 0.1.1
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc (subdir), git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-09-20 15:36:25,727 DEBUG: Analytics is enabled.
2022-09-20 15:36:25,758 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmp04tcned9']'
2022-09-20 15:36:25,760 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmp04tcned9']'

Sorry I can not reproduce this error.

asciicast

And I had another question, does this PR (changed the format of JSON output of exp show will affect the current version of the vs-code extension?).

BTW, I found that the current dvc exp show runs really slow, and with the help of Cprofile we can see that the three major time cost parts in dvc exp show are the scm revision description, the stages collection and the table rendering.

image

@mattseddon
Copy link
Contributor

Sorry I can not reproduce this error.

Can you try with watch -n0 dvc exp show --show-json? I can reproduce with just two terminals:

Screen.Recording.2022-09-21.at.8.54.12.pm.mov
Screen.Recording.2022-09-21.at.9.01.22.pm.mov

@karajan1001
Copy link
Contributor Author

karajan1001 commented Sep 22, 2022

Sorry I can not reproduce this error.

Can you try with watch -n0 dvc exp show --show-json? I can reproduce with just two terminals:

Screen.Recording.2022-09-21.at.8.54.12.pm.mov
Screen.Recording.2022-09-21.at.9.01.22.pm.mov

@mattseddon
Now I understand, in previous, I believe the error is a lasting one, but with the watch command I can see that the error occurs in an intermediate state. Tested on my local computer, I found that the bugs exists before this PR, and the PR solved the status out-of-sync problem of the exp show

asciicast

we should open some other issues for the problems during the exp show. What I currently found includes

  1. invalid ref during the data collection.
  2. tasks status turned to queued for 1 second before turned into success.
  3. invalid ref during exp remove

@dberenbaum
Copy link
Contributor

@mattseddon Are these issues blockers for you? If you are getting intermittent errors, is it possible to ignore those?

@mattseddon
Copy link
Contributor

@mattseddon Are these issues blockers for you? If you are getting intermittent errors, is it possible to ignore those?

This is a blocker. We cannot reliably ignore these errors as we cannot distinguish them from any other error type.

@dberenbaum
Copy link
Contributor

@mattseddon Are these issues blockers for you? If you are getting intermittent errors, is it possible to ignore those?

This is a blocker. We cannot reliably ignore these errors as we cannot distinguish them from any other error type.

Thanks @mattseddon.

A few follow up questions:

  1. Should it block the current PR? I see the VS Code table disappearing for a bit both before and after this PR. After this PR, I at least see much quicker updates to the table when the queue is started. Do you see the same? Are any of these errors new to this PR? I'm wondering if we can merge and work on the issues mentioned by @karajan1001 as follow ups.
  2. Will the table disappear anytime there is an error returned by exp show? It seems like a strong assumption for a command that is constantly running in the background. For example, why not raise an error dialog but keep the last version of the table visible until the error is resolved? Or wait some number of iterations/amount of time before showing the error?

@mattseddon
Copy link
Contributor

  1. Should it block the current PR? I see the VS Code table disappearing for a bit both before and after this PR. After this PR, I at least see much quicker updates to the table when the queue is started. Do you see the same? Are any of these errors new to this PR? I'm wondering if we can merge and work on the issues mentioned by @karajan1001 as follow ups.

Doesn't need to block this PR.

  1. Will the table disappear anytime there is an error returned by exp show? It seems like a strong assumption for a command that is constantly running in the background. For example, why not raise an error dialog but keep the last version of the table visible until the error is resolved? Or wait some number of iterations/amount of time before showing the error?

Yes, it will disappear for any error. I would like to move away from the papering over the cracks approach that we have taken up until now.

@karajan1001 karajan1001 merged commit 6bd14b2 into iterative:main Sep 26, 2022
@karajan1001 karajan1001 deleted the fix8088 branch September 26, 2022 08:25
@dberenbaum
Copy link
Contributor

Yes, it will disappear for any error. I would like to move away from the papering over the cracks approach that we have taken up until now.

Agreed, but I'd consider these separate issues. Regardless of how stable the commands become, it still seems severe to me to have the table disappear in case an unknown error ever occurs. I would almost always prefer it to be stale than have it disappear. Is there a reason to dropping the table is considered preferable?

@karajan1001
Copy link
Contributor Author

karajan1001 commented Sep 27, 2022

I gathered some of the other problems during my experience using exp show

  1. exp show slow in a repo with a large number of checkpoints, ( looks like related to the collection of every single checkpoint)
  2. The Initialization of a temp workspace was slow (in Matt's demo it usually takes about half a minute on my computer).
  3. During the Initialization above we Can't kill the queue tasks, because no info file during this progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp enhancement Enhances DVC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

exp show: sync state between queue and exp show table
4 participants