-
Notifications
You must be signed in to change notification settings - Fork 1.2k
exp show: sync state between queue and exp show table #8158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
68b0d92
to
4854874
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, executor.manager
should not be modified in this PR. We don't use it at all (it has been replaced by the queue classes). The code was left in since there is still some legacy dvc-machine/SSH executor related behavior that has not been moved into dvc-machine/SSH queues.
But at this point it's probably better for us to just remove it since it is currently unused, and then restore it once we get back to dvc-machine development.
(but removing it should be done in a separate PR)
dvc/repo/experiments/queue/celery.py
Outdated
@@ -248,7 +253,8 @@ def _load_info(rev: str) -> ExecutorInfo: | |||
|
|||
def _load_collected(rev: str) -> Optional[ExecutorResult]: | |||
executor_info = _load_info(rev) | |||
if executor_info.collected: | |||
print("executor_info is", executor_info.status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra debugging statement
scm.set_ref(EXEC_HEAD, entry.head_rev) | ||
scm.set_ref(EXEC_MERGE, stash_rev) | ||
scm.set_ref(EXEC_BASELINE, entry.baseline_rev) | ||
refspec = f"{EXEC_NAMESPACE}/" | ||
push_refspec(scm, self.git_url, refspec, refspec) | ||
finally: | ||
for ref in (EXEC_HEAD, EXEC_MERGE, EXEC_BASELINE): | ||
if scm.get_ref(ref): | ||
scm.remove_ref(ref) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not be duplicating this try/finally code across every executor type. It should either stay in queue.base
(before we init an executor), or executor classes should be re-using some code from BaseExecutor
.
The reason it was outside of executors before (in manager
and then queue.base
), is that ideally we should be setting and removing those refs at the same level. And whether or not we want to remove them depends on when and why we are generating any executor at all, it is not dependent on specific executor types.
Basically, whether or not to clear the refs after init'ing an executor depends on whether or not DVC needs to clean the entire workspace or not. In practice, this means whether we are doing a workspace run or any other type of run (tempdir, remote machine, etc). But the actual exectuors themselves will function whether or not the refs are cleared after calling init_git
. So the caller should be responsible for setting and clearing those refs, not the executor itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, whether or not to clear the refs after init'ing an executor depends on whether or not DVC needs to clean the entire workspace or not.
Hi, @pmrowla thanks for the suggestions but this is not about cleaning the executor references. but cleaning the workspace reference during initialization. The only place where we do not need to clean these references is in the workspace because we need to use these EXEC
references later in running.
The reason it was outside of executors before (in manager and then queue.base), is that ideally we should be setting and removing those refs at the same level.
This is the reason why I brought them to this place, the reference set is now in the queue
but the clean progress is inside the executor's init. And we had already forgotten to clean it in previous #8043. This is why I now put them all here on the same level.
scm.set_ref(EXEC_HEAD, entry.head_rev) | ||
scm.set_ref(EXEC_MERGE, stash_rev) | ||
scm.set_ref(EXEC_BASELINE, entry.baseline_rev) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karajan1001 this is still a problem where we have a lot of duplicated code (in all of the different executor init methods). If you want to keep this in the executor classes we can do that, but the common code should be refactored so that it goes in the base executor class.
Maybe something like this in the base class
@contextmanager
def set_exec_refs(self):
# init refs
scm.set_ref(...)
try:
yield
finally:
# cleanup refs
self.scm.remove_ref(...)
and then within each executor init_git you would have
def init_git():
with self.init_exec_refs():
# existing init_git code()
...
# any additional executor specific cleanup handling here
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one exception in the WorkspaceExecutor
, because it will remove these references after the task has been completed.
afef91f
to
47f4277
Compare
Looks like need to clean the manager or the pylint will fail. |
bfa68bc
to
b4beb4e
Compare
Excuse me, @mattseddon, Could you please help me to verify if the problem is solved after this PR on the VSCode extension? Thank you. |
I will test 👍🏻. |
This is the experience in the extension using queue-start-j-1.movIt is an issue that
Question: Is Note: Probably unrelated to this change but I started by trying to run with I went through the following steps:
After those failures, any attempt to queue an experiment would result in the experiment being run straight away. Even though
The only way that I could get the repo out of this state was to delete |
For the
|
Verbose log for error:
|
Reason: 1. Need some more granular control over initialization progress of executor. What done: 1. Seperate `init_git` and `init_cache` progress out from setup_executor 2. Move `set_ref` into `init_git`
1. Add a new attribute status to ExecutorInfo file 2. Update running status to the executor infofile.
fix: iterative#8088 1. Use status to replace collected. 2. Make status show more accurate. Fix lint
1. Move some basic test script from function tests to unit test. 2. Add success/failed tests for the status change of `tempdir`, `celery`, `workspace` running case.
e0d67ef
to
2c506b0
Compare
The previous problem was because the collect result progress failed in the git Could you please try it again,(I guess your local repo might be polluted in the previous test, and might need to clean the result manually, building a completely new workspace might help, but the previous error was caused in a dirty env, the newly built one might not trigger the previous error)? |
Currently infofile dump was spread in both executor and queue classes. put them all into executor to make them more managable. Move all ending dump operations to the cleanup function, because the collect_result progress might failed because of git operations, in which can guarantee them to be run in a `finally` scope.
2c506b0
to
94c458d
Compare
I will test today. |
@karajan1001 I'm still seeing the same behaviour. Even with a fresh clone of https://github.com/iterative/vscode-dvc:
|
Sorry I can not reproduce this error. And I had another question, does this PR (changed the format of JSON output of BTW, I found that the current |
Can you try with Screen.Recording.2022-09-21.at.8.54.12.pm.movScreen.Recording.2022-09-21.at.9.01.22.pm.mov |
@mattseddon we should open some other issues for the problems during the
|
@mattseddon Are these issues blockers for you? If you are getting intermittent errors, is it possible to ignore those? |
This is a blocker. We cannot reliably ignore these errors as we cannot distinguish them from any other error type. |
Thanks @mattseddon. A few follow up questions:
|
Doesn't need to block this PR.
Yes, it will disappear for any error. I would like to move away from the papering over the cracks approach that we have taken up until now. |
Agreed, but I'd consider these separate issues. Regardless of how stable the commands become, it still seems severe to me to have the table disappear in case an unknown error ever occurs. I would almost always prefer it to be stale than have it disappear. Is there a reason to dropping the table is considered preferable? |
I gathered some of the other problems during my experience using
|
fix: #8088
executor.init_git
tempdir
,celery
,workspace
running case.❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏