-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix some celery queue related ci failure. #8404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karajan1001 looks like there's CI failures with these test changes
dvc/repo/experiments/queue/celery.py
Outdated
except FileNotFoundError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the chances of this file never being created? Worried about infinite loop here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the time cost here is to wait for the data transfer to finish but the problem here is that we do not know how long it would take.
One solution is to give a warning and exit if it hadn't finished after 5 or 10 seconds.
@karajan1001, try changing the following line: dvc/.github/workflows/tests.yaml Lines 48 to 50 in 8640beb
to: pytest-filter:
- "test_queue or experiment or exp"
- "test_queue or experiment or exp" That will run 20 (2x10) jobs. If you need more jobs, add more lines to |
03a80b4
to
8f02c4b
Compare
b525749
to
fe53d7a
Compare
dvc/repo/experiments/queue/celery.py
Outdated
MAX_RETRY = 5 | ||
for _ in range(MAX_RETRY): | ||
for _, queue_entry in self._iter_done_tasks(): | ||
if queue_entry == entry: | ||
logger.debug("entry %s finished", entry.stash_rev) | ||
return | ||
time.sleep(1) | ||
logger.warning( | ||
"Post process experiment %s time out with max retries %d.", | ||
entry.stash_rev, | ||
MAX_RETRY, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't belong in follow()
, this will break the use case where we the user is using queue logs -f
and ctrl-c's to stop viewing the logs (it should exit without waiting for the underlying task to finish).
If there are places that use follow()
but we need to actually wait for the entire task to finish, we should really be doing something like
celery_queue.follow(entry)
celery_queue.get_result(entry)
get_result()
has better logic for waiting until the given entry is completed, we should avoid this kind of busy-wait sleep()
whenever possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the problem is that the get_result
is leaky, we might get the result before the tasks are complete.
dvc/dvc/repo/experiments/queue/celery.py
Lines 255 to 265 in c01583f
def _load_collected(rev: str) -> Optional[ExecutorResult]: | |
executor_info = _load_info(rev) | |
if executor_info.status > TaskStatus.SUCCESS: | |
return executor_info.result | |
raise FileNotFoundError | |
try: | |
return _load_collected(entry.stash_rev) | |
except FileNotFoundError: | |
# Infofile will not be created until execution begins | |
pass |
Here we directly look into the result directly without checking the AysncResult.ready()
or waiting until``AysncResult.get()` is returned. And this is where the problem is.
Codecov ReportBase: 94.31% // Head: 93.98% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #8404 +/- ##
==========================================
- Coverage 94.31% 93.98% -0.33%
==========================================
Files 430 430
Lines 32840 32839 -1
Branches 4592 4587 -5
==========================================
- Hits 30972 30864 -108
- Misses 1448 1538 +90
- Partials 420 437 +17
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. β View full report at Codecov. |
2278294
to
3dbf052
Compare
@skshetry celery tests in |
The error does look legit. Why is it trying to remove |
@karajan1001 are they flaky (and sometimes pass) in 3.11 or do they always fail? Either way it is probably ok to just mark the celery tests with:
We may also need to consider disabling the queue related commands (or at least outputting a warning) if we are in 3.11, but that can be addressed in a separate issue (similar to how we don't support hydra functionality in 3.11) |
They are already failing in main, so we can ignore here. The issue seems unrelated to celery at a quick glance. So no need to xfail/skip it, celery seems to be working fine on 3.11 (given itβs pure python). The failure looks to be our fault, we need to investigate it separately. |
Looking into it, it always passes for me in Windows (and, |
Always fail on Windows.
Let's track them in a separate issue. |
@karajan1001, can you remove the changes to pytest-filter in github workflow? After @pmrowla approves, we can merge this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to also remove the @pytest.mark.parametrize("repeat", range(10))
usage in addition to the pytest-filter
changes before merging
fix: iterative#8403 1. remove some of the flaky mark 2. In `get_result` make sure the celery task is completed.
1. Modify run all to include currently running exps. 2. bump dvc-task to 0.1.5
The failure looks similar to python/cpython#97641, however I am not able to reproduce locally. |
There seems to be a regression in Python 3.11, where the sqlite connections are not deallocated, due to some internal changes in Python 3.11, where they are now using LRU cache. They are not deallocated until `gc.collect()` is not called. See python/cpython#97641. This affects only Windows, because when we try to remove the tempdir for the exp run, the sqlite connection is open which prevents us from deleting that folder. Although this may happen in real scenario in `exp run`, I am only fixing the tests by mocking `dvc.close()` and extending it by calling `gc.collect()` after it. We could also mock `State.close()` but didnot want to mock something that is not in dvc itself. The `diskcache` uses threadlocal for connections, so they are expected to be garbage collected, and therefore does not provide a good way to close the connections. The only API it offers is `self.close()` and that only closes main thread's connection. If we had access to connection, an easier way would have been to explicitly call `conn.close()`. But we don''t have such option at the moment. Related: iterative#8404 (comment) GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57
Looks like I can not force merge it. |
I have a fix here #8547 that fixes the test. |
There seems to be a regression in Python 3.11, where the sqlite connections are not deallocated, due to some internal changes in Python 3.11, where they are now using LRU cache. They are not deallocated until `gc.collect()` is not called. See python/cpython#97641. This affects only Windows, because when we try to remove the tempdir for the exp run, the sqlite connection is open which prevents us from deleting that folder. Although this may happen in real scenario in `exp run`, I am only fixing the tests by mocking `dvc.close()` and extending it by calling `gc.collect()` after it. We could also mock `State.close()` but didnot want to mock something that is not in dvc itself. The `diskcache` uses threadlocal for connections, so they are expected to be garbage collected, and therefore does not provide a good way to close the connections. The only API it offers is `self.close()` and that only closes main thread's connection. If we had access to connection, an easier way would have been to explicitly call `conn.close()`. But we don''t have such option at the moment. Related: iterative#8404 (comment) GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57
There seems to be a regression in Python 3.11, where the sqlite connections are not deallocated, due to some internal changes in Python 3.11, where they are now using LRU cache. They are not deallocated until `gc.collect()` is not called. See python/cpython#97641. This affects only Windows, because when we try to remove the tempdir for the exp run, the sqlite connection is open which prevents us from deleting that folder. Although this may happen in real scenario in `exp run`, I am only fixing the tests by mocking `dvc.close()` and extending it by calling `gc.collect()` after it. We could also mock `State.close()` but didnot want to mock something that is not in dvc itself. The `diskcache` uses threadlocal for connections, so they are expected to be garbage collected, and therefore does not provide a good way to close the connections. The only API it offers is `self.close()` and that only closes main thread's connection. If we had access to connection, an easier way would have been to explicitly call `conn.close()`. But we don''t have such option at the moment. Related: #8404 (comment) GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57
wait for #8349
fix: #8403
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π