exp show: sync state between queue and exp show table #8158

karajan1001 · 2022-08-22T08:46:34Z

Refactor seperate the initialization of executor and setup environment
Move ref setup into executor.init_git
Add a new attribute status to ExecutorInfo file
Update running status to the executor infofile.
Use task status to replace collected.
Move some basic test script from function tests to unit test.
Add success/failed tests for the status change of tempdir, celery, workspace running case.

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

pmrowla

Also, executor.manager should not be modified in this PR. We don't use it at all (it has been replaced by the queue classes). The code was left in since there is still some legacy dvc-machine/SSH executor related behavior that has not been moved into dvc-machine/SSH queues.

But at this point it's probably better for us to just remove it since it is currently unused, and then restore it once we get back to dvc-machine development.

(but removing it should be done in a separate PR)

pmrowla · 2022-08-24T05:37:41Z

dvc/repo/experiments/queue/celery.py

@@ -248,7 +253,8 @@ def _load_info(rev: str) -> ExecutorInfo:

        def _load_collected(rev: str) -> Optional[ExecutorResult]:
            executor_info = _load_info(rev)
-            if executor_info.collected:
+            print("executor_info is", executor_info.status)


extra debugging statement

pmrowla · 2022-08-24T05:48:25Z

dvc/repo/experiments/executor/local.py

+            scm.set_ref(EXEC_HEAD, entry.head_rev)
+            scm.set_ref(EXEC_MERGE, stash_rev)
+            scm.set_ref(EXEC_BASELINE, entry.baseline_rev)
+            refspec = f"{EXEC_NAMESPACE}/"
+            push_refspec(scm, self.git_url, refspec, refspec)
+        finally:
+            for ref in (EXEC_HEAD, EXEC_MERGE, EXEC_BASELINE):
+                if scm.get_ref(ref):
+                    scm.remove_ref(ref)
+


We should not be duplicating this try/finally code across every executor type. It should either stay in queue.base (before we init an executor), or executor classes should be re-using some code from BaseExecutor.

The reason it was outside of executors before (in manager and then queue.base), is that ideally we should be setting and removing those refs at the same level. And whether or not we want to remove them depends on when and why we are generating any executor at all, it is not dependent on specific executor types.

Basically, whether or not to clear the refs after init'ing an executor depends on whether or not DVC needs to clean the entire workspace or not. In practice, this means whether we are doing a workspace run or any other type of run (tempdir, remote machine, etc). But the actual exectuors themselves will function whether or not the refs are cleared after calling init_git. So the caller should be responsible for setting and clearing those refs, not the executor itself.

Basically, whether or not to clear the refs after init'ing an executor depends on whether or not DVC needs to clean the entire workspace or not.

Hi, @pmrowla thanks for the suggestions but this is not about cleaning the executor references. but cleaning the workspace reference during initialization. The only place where we do not need to clean these references is in the workspace because we need to use these EXEC references later in running.

The reason it was outside of executors before (in manager and then queue.base), is that ideally we should be setting and removing those refs at the same level.

This is the reason why I brought them to this place, the reference set is now in the queue but the clean progress is inside the executor's init. And we had already forgotten to clean it in previous #8043. This is why I now put them all here on the same level.

pmrowla · 2022-09-08T05:43:39Z

dvc/repo/experiments/executor/local.py

+        scm.set_ref(EXEC_HEAD, entry.head_rev)
+        scm.set_ref(EXEC_MERGE, stash_rev)
+        scm.set_ref(EXEC_BASELINE, entry.baseline_rev)


@karajan1001 this is still a problem where we have a lot of duplicated code (in all of the different executor init methods). If you want to keep this in the executor classes we can do that, but the common code should be refactored so that it goes in the base executor class.

Maybe something like this in the base class

@contextmanager def set_exec_refs(self): # init refs scm.set_ref(...) try: yield finally: # cleanup refs self.scm.remove_ref(...)

and then within each executor init_git you would have

def init_git(): with self.init_exec_refs(): # existing init_git code() ... # any additional executor specific cleanup handling here ...

Only one exception in the WorkspaceExecutor, because it will remove these references after the task has been completed.

karajan1001 · 2022-09-09T08:08:48Z

Looks like need to clean the manager or the pylint will fail.

karajan1001 · 2022-09-14T05:18:26Z

Excuse me, @mattseddon, Could you please help me to verify if the problem is solved after this PR on the VSCode extension? Thank you.

mattseddon · 2022-09-14T05:34:42Z

Excuse me, @mattseddon, Could you please help me to verify if the problem is solved after this PR on the VSCode extension? Thank you.

I will test 👍🏻.

mattseddon · 2022-09-15T01:49:53Z

This is the experience in the extension using dvc queue start -j 1 with 3 queued experiments:

queue-start-j-1.mov

It is an issue that dvc exp show can run into unexpected errors that look like this:

ERROR: unexpected error - Invalid revision: b'02712dc464ab868043e7eefc335a8d5fd39ab6f7'

Question: Is running in exp show synced with TaskStatus.PREPARING in the executor or is there another field in the output that I should be looking for?

Note: Probably unrelated to this change but I started by trying to run with -j 3 and saw some very weird behaviour.

I went through the following steps:

installed git+https://github.com/karajan1001/dvc.git@fix8088 into a demo project's virtual environment
dvc exp run --queue x 3 with different params for each
dvc queue start -j 3
One experiment succeeded and two failed.

After those failures, any attempt to queue an experiment would result in the experiment being run straight away. Even though queue status stated that there were no active workers:

~/demo main !2 ?1 ❯ dvc queue status
Task     Name       Created    Status
e7bf66b             10:35 AM   Failed
ffc912f             10:35 AM   Failed
08a047c  exp-92e70  10:35 AM   Success
6e259ab  exp-684d6  10:41 AM   Success
aad7cf0  exp-14b96  10:40 AM   Success

Worker status: 0 active, 0 idle

~/demo main !2 ?1 ❯ dvc exp run --queue
Queued experiment '2f7751d' for future execution.   
                                                                                                                                                                                                                                                             
~/demo main !2 ?1 ❯ dvc queue status
Task     Name       Created    Status
2f7751d             10:42 AM   Running
e7bf66b             10:35 AM   Failed
ffc912f             10:35 AM   Failed
08a047c  exp-92e70  10:35 AM   Success
6e259ab  exp-684d6  10:41 AM   Success
aad7cf0  exp-14b96  10:40 AM   Success

Worker status: 0 active, 0 idle

The only way that I could get the repo out of this state was to delete .dvc/tmp/exps. dvc queue stop & dvc queue kill had no impact.

karajan1001 · 2022-09-15T07:20:51Z

Note: Probably unrelated to this change but I started by trying to run with -j 3 and saw some very weird behaviour.

I went through the following steps:

installed git+https://github.com/karajan1001/dvc.git@fix8088 into a demo project's virtual environment
dvc exp run --queue x 3 with different params for each
dvc queue start -j 3
One experiment succeeded and two failed.
After those failures, any attempt to queue an experiment would result in the experiment being run straight away. Even > though queue status stated that there were no active workers:

For the job count 3, you need to test it after iterative/dvc-task#90 merged.

Question: Is running in exp show synced with TaskStatus.PREPARING in the executor or is there another field in the output that I should be looking for?

exp show reads the TaskStatus of each exps but not only depends on them, because the TaskStatus will only be generated after the exp begins to run.

mattseddon · 2022-09-16T00:25:08Z

Verbose log for error:

~/projects/vscode-dvc/demo main *1 !4 ?1 ❯ dvc exp show --show-json -v                                                                                                                                                                                                          ✘ 252 18s  .env  base 10:23:44
2022-09-16 10:23:52,332 ERROR: unexpected error - Invalid revision: b'd0b057085e1e96b3406f78cd9cb2decfb86976b3'
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 652, in fetch_refspecs
    check_diverged(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dulwich/porcelain.py", line 347, in check_diverged
    raise DivergedBranches(current_sha, new_sha)
dulwich.porcelain.DivergedBranches: b'd0b057085e1e96b3406f78cd9cb2decfb86976b3'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 745, in diff
    commit_a = self.repo[os.fsencode(rev_a)]
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dulwich/repo.py", line 787, in __getitem__
    return self.object_store[self.refs[name]]
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dulwich/refs.py", line 320, in __getitem__
    raise KeyError(name)
KeyError: b'd0b057085e1e96b3406f78cd9cb2decfb86976b3'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/commands/experiments/show.py", line 475, in run
    all_experiments = self.repo.experiments.show(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 516, in show
    return show(self.repo, *args, **kwargs)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/show.py", line 163, in show
    running = repo.experiments.get_running_exps(fetch_refs=fetch_running)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 443, in get_running_exps
    self._fetch_running_exp(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 482, in _fetch_running_exp
    for ref in executor.fetch_exps(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 358, in fetch_exps
    dest_scm.fetch_refspecs(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 658, in fetch_refspecs
    on_diverged(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 349, in on_diverged_ref
    self._raise_ref_conflict(
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 734, in _raise_ref_conflict
    if scm.diff(orig_rev, new_rev):
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/Users/mattseddon/projects/vscode-dvc/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 748, in diff
    raise RevError("Invalid revision") from exc
scmrepo.exceptions.RevError: Invalid revision
------------------------------------------------------------
2022-09-16 10:23:52,444 DEBUG: Removing '/Users/mattseddon/projects/vscode-dvc/.UUrYrHCXCoi7th3ronGrot.tmp'
2022-09-16 10:23:52,444 DEBUG: Removing '/Users/mattseddon/projects/vscode-dvc/.UUrYrHCXCoi7th3ronGrot.tmp'
2022-09-16 10:23:52,445 DEBUG: Removing '/Users/mattseddon/projects/vscode-dvc/.UUrYrHCXCoi7th3ronGrot.tmp'
2022-09-16 10:23:52,445 DEBUG: Removing '/Users/mattseddon/projects/vscode-dvc/demo/.dvc/cache/.Smc3Bw3TUaCcMUU6quBhPH.tmp'
2022-09-16 10:23:52,445 DEBUG: Version info for developers:
DVC version: 1.0.2.dev2348+gb4beb4e8 
---------------------------------
Platform: Python 3.9.13 on macOS-12.6-arm64-arm-64bit
Subprojects:
        dvc_data = 0.7.1
        dvc_objects = 0.2.2
        dvc_render = 0.0.10
        dvc_task = 0.1.2
        dvclive = 0.10.0
        scmrepo = 0.1.1
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.8.2, boto3 = 1.24.59)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc (subdir), git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-09-16 10:23:52,446 DEBUG: Analytics is enabled.
2022-09-16 10:23:52,482 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmpyxrdv_qe']'
2022-09-16 10:23:52,484 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmpyxrdv_qe']'

Reason: 1. Need some more granular control over initialization progress of executor. What done: 1. Seperate `init_git` and `init_cache` progress out from setup_executor 2. Move `set_ref` into `init_git`

1. Add a new attribute status to ExecutorInfo file 2. Update running status to the executor infofile.

fix: iterative#8088 1. Use status to replace collected. 2. Make status show more accurate. Fix lint

1. Move some basic test script from function tests to unit test. 2. Add success/failed tests for the status change of `tempdir`, `celery`, `workspace` running case.

karajan1001 · 2022-09-19T01:37:09Z

@mattseddon

The previous problem was because the collect result progress failed in the git pull operations ( because of duplicated experiment names), and make the final result collection failed. Now I had moved all ending dump operations to the cleanup function, in which can guarantee them to be run in a finally scope.

Could you please try it again,(I guess your local repo might be polluted in the previous test, and might need to clean the result manually, building a completely new workspace might help, but the previous error was caused in a dirty env, the newly built one might not trigger the previous error)?

Currently infofile dump was spread in both executor and queue classes. put them all into executor to make them more managable. Move all ending dump operations to the cleanup function, because the collect_result progress might failed because of git operations, in which can guarantee them to be run in a `finally` scope.

mattseddon · 2022-09-19T23:59:50Z

Could you please try it again

I will test today.

mattseddon · 2022-09-20T05:44:38Z

@karajan1001 I'm still seeing the same behaviour. Even with a fresh clone of https://github.com/iterative/vscode-dvc:

dvc exp show --show-json -v
2022-09-20 15:36:25,649 ERROR: unexpected error - Invalid revision: b'1dbb61e08d6d96c6db910a6d392cb1a1bdb9d04d'
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 652, in fetch_refspecs
    check_diverged(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dulwich/porcelain.py", line 347, in check_diverged
    raise DivergedBranches(current_sha, new_sha)
dulwich.porcelain.DivergedBranches: b'2e1b8fbd00600a7457fb91fa14fa7d248a73913b'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 746, in diff
    commit_b = self.repo[os.fsencode(rev_b)]
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dulwich/repo.py", line 787, in __getitem__
    return self.object_store[self.refs[name]]
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dulwich/refs.py", line 320, in __getitem__
    raise KeyError(name)
KeyError: b'1dbb61e08d6d96c6db910a6d392cb1a1bdb9d04d'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/commands/experiments/show.py", line 475, in run
    all_experiments = self.repo.experiments.show(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 516, in show
    return show(self.repo, *args, **kwargs)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/show.py", line 163, in show
    running = repo.experiments.get_running_exps(fetch_refs=fetch_running)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 443, in get_running_exps
    self._fetch_running_exp(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 482, in _fetch_running_exp
    for ref in executor.fetch_exps(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 363, in fetch_exps
    dest_scm.fetch_refspecs(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 658, in fetch_refspecs
    on_diverged(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 354, in on_diverged_ref
    self._raise_ref_conflict(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 739, in _raise_ref_conflict
    if scm.diff(orig_rev, new_rev):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 748, in diff
    raise RevError("Invalid revision") from exc
scmrepo.exceptions.RevError: Invalid revision
------------------------------------------------------------
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/.RzXRu93vW73iJFXLaqEGG2.tmp'
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/.RzXRu93vW73iJFXLaqEGG2.tmp'
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/.RzXRu93vW73iJFXLaqEGG2.tmp'
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/demo/.dvc/cache/.SR4jZgarDBcjiLMyqZnbZg.tmp'
2022-09-20 15:36:25,726 DEBUG: Version info for developers:
DVC version: 1.0.2.dev2371+g94c458d3 
---------------------------------
Platform: Python 3.9.13 on macOS-12.6-arm64-arm-64bit
Subprojects:
        dvc_data = 0.10.0
        dvc_objects = 0.4.0
        dvc_render = 0.0.11
        dvc_task = 0.1.2
        dvclive = 0.10.0
        scmrepo = 0.1.1
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc (subdir), git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-09-20 15:36:25,727 DEBUG: Analytics is enabled.
2022-09-20 15:36:25,758 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmp04tcned9']'
2022-09-20 15:36:25,760 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmp04tcned9']'

karajan1001 · 2022-09-21T07:35:58Z

@karajan1001 I'm still seeing the same behaviour. Even with a fresh clone of https://github.com/iterative/vscode-dvc:

dvc exp show --show-json -v
2022-09-20 15:36:25,649 ERROR: unexpected error - Invalid revision: b'1dbb61e08d6d96c6db910a6d392cb1a1bdb9d04d'
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 652, in fetch_refspecs
    check_diverged(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dulwich/porcelain.py", line 347, in check_diverged
    raise DivergedBranches(current_sha, new_sha)
dulwich.porcelain.DivergedBranches: b'2e1b8fbd00600a7457fb91fa14fa7d248a73913b'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 746, in diff
    commit_b = self.repo[os.fsencode(rev_b)]
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dulwich/repo.py", line 787, in __getitem__
    return self.object_store[self.refs[name]]
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dulwich/refs.py", line 320, in __getitem__
    raise KeyError(name)
KeyError: b'1dbb61e08d6d96c6db910a6d392cb1a1bdb9d04d'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/commands/experiments/show.py", line 475, in run
    all_experiments = self.repo.experiments.show(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 516, in show
    return show(self.repo, *args, **kwargs)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/show.py", line 163, in show
    running = repo.experiments.get_running_exps(fetch_refs=fetch_running)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 443, in get_running_exps
    self._fetch_running_exp(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 482, in _fetch_running_exp
    for ref in executor.fetch_exps(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 363, in fetch_exps
    dest_scm.fetch_refspecs(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 658, in fetch_refspecs
    on_diverged(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 354, in on_diverged_ref
    self._raise_ref_conflict(
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/dvc/repo/experiments/executor/base.py", line 739, in _raise_ref_conflict
    if scm.diff(orig_rev, new_rev):
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 289, in _backend_func
    result = func(*args, **kwargs)
  File "/Users/mattseddon/projects/vc-nc1/demo/.env/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 748, in diff
    raise RevError("Invalid revision") from exc
scmrepo.exceptions.RevError: Invalid revision
------------------------------------------------------------
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/.RzXRu93vW73iJFXLaqEGG2.tmp'
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/.RzXRu93vW73iJFXLaqEGG2.tmp'
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/.RzXRu93vW73iJFXLaqEGG2.tmp'
2022-09-20 15:36:25,725 DEBUG: Removing '/Users/mattseddon/projects/vc-nc1/demo/.dvc/cache/.SR4jZgarDBcjiLMyqZnbZg.tmp'
2022-09-20 15:36:25,726 DEBUG: Version info for developers:
DVC version: 1.0.2.dev2371+g94c458d3 
---------------------------------
Platform: Python 3.9.13 on macOS-12.6-arm64-arm-64bit
Subprojects:
        dvc_data = 0.10.0
        dvc_objects = 0.4.0
        dvc_render = 0.0.11
        dvc_task = 0.1.2
        dvclive = 0.10.0
        scmrepo = 0.1.1
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc (subdir), git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-09-20 15:36:25,727 DEBUG: Analytics is enabled.
2022-09-20 15:36:25,758 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmp04tcned9']'
2022-09-20 15:36:25,760 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/sb/fqcw44jd19nfrhl_9lz_81d80000gn/T/tmp04tcned9']'

Sorry I can not reproduce this error.

And I had another question, does this PR (changed the format of JSON output of exp show will affect the current version of the vs-code extension?).

BTW, I found that the current dvc exp show runs really slow, and with the help of Cprofile we can see that the three major time cost parts in dvc exp show are the scm revision description, the stages collection and the table rendering.

mattseddon · 2022-09-21T11:04:35Z

Sorry I can not reproduce this error.

Can you try with watch -n0 dvc exp show --show-json? I can reproduce with just two terminals:

Screen.Recording.2022-09-21.at.8.54.12.pm.mov

Screen.Recording.2022-09-21.at.9.01.22.pm.mov

karajan1001 · 2022-09-22T06:53:06Z

Sorry I can not reproduce this error.

Can you try with watch -n0 dvc exp show --show-json? I can reproduce with just two terminals:

Screen.Recording.2022-09-21.at.8.54.12.pm.mov
Screen.Recording.2022-09-21.at.9.01.22.pm.mov

@mattseddon
Now I understand, in previous, I believe the error is a lasting one, but with the watch command I can see that the error occurs in an intermediate state. Tested on my local computer, I found that the bugs exists before this PR, and the PR solved the status out-of-sync problem of the exp show

we should open some other issues for the problems during the exp show. What I currently found includes

invalid ref during the data collection.
tasks status turned to queued for 1 second before turned into success.
invalid ref during exp remove

dberenbaum · 2022-09-22T13:30:17Z

@mattseddon Are these issues blockers for you? If you are getting intermittent errors, is it possible to ignore those?

mattseddon · 2022-09-22T23:47:27Z

@mattseddon Are these issues blockers for you? If you are getting intermittent errors, is it possible to ignore those?

This is a blocker. We cannot reliably ignore these errors as we cannot distinguish them from any other error type.

dberenbaum · 2022-09-23T13:58:34Z

@mattseddon Are these issues blockers for you? If you are getting intermittent errors, is it possible to ignore those?

This is a blocker. We cannot reliably ignore these errors as we cannot distinguish them from any other error type.

Thanks @mattseddon.

A few follow up questions:

Should it block the current PR? I see the VS Code table disappearing for a bit both before and after this PR. After this PR, I at least see much quicker updates to the table when the queue is started. Do you see the same? Are any of these errors new to this PR? I'm wondering if we can merge and work on the issues mentioned by @karajan1001 as follow ups.
Will the table disappear anytime there is an error returned by exp show? It seems like a strong assumption for a command that is constantly running in the background. For example, why not raise an error dialog but keep the last version of the table visible until the error is resolved? Or wait some number of iterations/amount of time before showing the error?

mattseddon · 2022-09-25T22:36:58Z

Should it block the current PR? I see the VS Code table disappearing for a bit both before and after this PR. After this PR, I at least see much quicker updates to the table when the queue is started. Do you see the same? Are any of these errors new to this PR? I'm wondering if we can merge and work on the issues mentioned by @karajan1001 as follow ups.

Doesn't need to block this PR.

Will the table disappear anytime there is an error returned by exp show? It seems like a strong assumption for a command that is constantly running in the background. For example, why not raise an error dialog but keep the last version of the table visible until the error is resolved? Or wait some number of iterations/amount of time before showing the error?

Yes, it will disappear for any error. I would like to move away from the papering over the cracks approach that we have taken up until now.

dberenbaum · 2022-09-26T19:52:46Z

Yes, it will disappear for any error. I would like to move away from the papering over the cracks approach that we have taken up until now.

Agreed, but I'd consider these separate issues. Regardless of how stable the commands become, it still seems severe to me to have the table disappear in case an unknown error ever occurs. I would almost always prefer it to be stale than have it disappear. Is there a reason to dropping the table is considered preferable?

karajan1001 · 2022-09-27T03:07:13Z

I gathered some of the other problems during my experience using exp show

exp show slow in a repo with a large number of checkpoints, ( looks like related to the collection of every single checkpoint)
The Initialization of a temp workspace was slow (in Matt's demo it usually takes about half a minute on my computer).
During the Initialization above we Can't kill the queue tasks, because no info file during this progress.

karajan1001 added enhancement Enhances DVC A: experiments Related to dvc exp A: executors labels Aug 22, 2022

karajan1001 requested a review from pmrowla August 22, 2022 08:46

karajan1001 self-assigned this Aug 22, 2022

karajan1001 requested a review from a team as a code owner August 22, 2022 08:46

karajan1001 marked this pull request as draft August 22, 2022 08:46

karajan1001 force-pushed the fix8088 branch 2 times, most recently from 68b0d92 to 4854874 Compare August 23, 2022 09:18

karajan1001 changed the title ~~[WIP] exp show: sync state between queue and exp show table~~ exp show: sync state between queue and exp show table Aug 23, 2022

karajan1001 marked this pull request as ready for review August 23, 2022 10:11

karajan1001 force-pushed the fix8088 branch from 6ec7208 to 873cfa4 Compare August 23, 2022 13:09

pmrowla suggested changes Aug 24, 2022

View reviewed changes

pmrowla mentioned this pull request Aug 25, 2022

queue: edits to new docs iterative/dvc.org#3894

Merged

pmrowla reviewed Sep 8, 2022

View reviewed changes

karajan1001 force-pushed the fix8088 branch 2 times, most recently from afef91f to 47f4277 Compare September 9, 2022 08:00

karajan1001 requested a review from pmrowla September 9, 2022 08:00

pmrowla approved these changes Sep 13, 2022

View reviewed changes

karajan1001 force-pushed the fix8088 branch from bfa68bc to b4beb4e Compare September 14, 2022 04:37

karajan1001 added 3 commits September 19, 2022 09:34

Refactor seperate the initialization of executor and setup environment

e30f915

Reason: 1. Need some more granular control over initialization progress of executor. What done: 1. Seperate `init_git` and `init_cache` progress out from setup_executor 2. Move `set_ref` into `init_git`

Add a new attribute status to ExecutorInfo file

2493466

1. Add a new attribute status to ExecutorInfo file 2. Update running status to the executor infofile.

Use task status to replace collected.

b9916aa

fix: iterative#8088 1. Use status to replace collected. 2. Make status show more accurate. Fix lint

karajan1001 added 3 commits September 19, 2022 09:34

Add tests for executor status

7f1cabe

1. Move some basic test script from function tests to unit test. 2. Add success/failed tests for the status change of `tempdir`, `celery`, `workspace` running case.

Remove the stale manager code

adcfd56

Make executor info compatible with old versions.

b61c065

karajan1001 force-pushed the fix8088 branch from e0d67ef to 2c506b0 Compare September 19, 2022 01:34

karajan1001 force-pushed the fix8088 branch from 2c506b0 to 94c458d Compare September 19, 2022 01:43

This was referenced Sep 23, 2022

invalid ref during the data collection. #8348

Closed

tasks status turned to queued for 1 second before turned into success. #8349

Closed

Make exp show handle errors better #8350

Closed

Merge branch 'main' into fix8088

373135a

karajan1001 merged commit 6bd14b2 into iterative:main Sep 26, 2022

karajan1001 deleted the fix8088 branch September 26, 2022 08:25

karajan1001 mentioned this pull request Nov 3, 2022

exp show: lockless issues #7693

Closed

exp show: sync state between queue and exp show table #8158

exp show: sync state between queue and exp show table #8158

Uh oh!

Conversation

karajan1001 commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmrowla left a comment

Choose a reason for hiding this comment

Uh oh!

pmrowla Aug 24, 2022

Choose a reason for hiding this comment

Uh oh!

pmrowla Aug 24, 2022

Choose a reason for hiding this comment

Uh oh!

karajan1001 Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

pmrowla Sep 8, 2022

Choose a reason for hiding this comment

Uh oh!

karajan1001 Sep 9, 2022

Choose a reason for hiding this comment

Uh oh!

karajan1001 commented Sep 9, 2022

Uh oh!

karajan1001 commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattseddon commented Sep 14, 2022

Uh oh!

mattseddon commented Sep 15, 2022

Uh oh!

karajan1001 commented Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattseddon commented Sep 16, 2022

Uh oh!

karajan1001 commented Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattseddon commented Sep 19, 2022

Uh oh!

mattseddon commented Sep 20, 2022

Uh oh!

karajan1001 commented Sep 21, 2022

Uh oh!

mattseddon commented Sep 21, 2022

Uh oh!

karajan1001 commented Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dberenbaum commented Sep 22, 2022

Uh oh!

mattseddon commented Sep 22, 2022

Uh oh!

dberenbaum commented Sep 23, 2022

Uh oh!

mattseddon commented Sep 25, 2022

Uh oh!

dberenbaum commented Sep 26, 2022

Uh oh!

karajan1001 commented Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

karajan1001 commented Aug 22, 2022 •

edited

Loading

karajan1001 commented Sep 14, 2022 •

edited

Loading

karajan1001 commented Sep 15, 2022 •

edited

Loading

karajan1001 commented Sep 19, 2022 •

edited

Loading

karajan1001 commented Sep 22, 2022 •

edited

Loading

karajan1001 commented Sep 27, 2022 •

edited

Loading