Fix some celery queue related ci failure. #8404

karajan1001 · 2022-10-06T10:05:16Z

wait for #8349
fix: #8403

Make the follow exit after the tasks been finished.
remove some of the flaky mark

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

pmrowla

@karajan1001 looks like there's CI failures with these test changes

skshetry · 2022-10-07T08:03:54Z

dvc/repo/experiments/queue/celery.py

+            except FileNotFoundError:
+                pass


What are the chances of this file never being created? Worried about infinite loop here.

The problem is that the time cost here is to wait for the data transfer to finish but the problem here is that we do not know how long it would take.

One solution is to give a warning and exit if it hadn't finished after 5 or 10 seconds.

skshetry · 2022-10-07T08:13:21Z

@karajan1001, try changing the following line:

dvc/.github/workflows/tests.yaml

Lines 48 to 50 in 8640beb

    
           pytest-filter:  
        
           - "import or plot or live or experiment" 
        
           - "not (import or plot or live or experiment)"

to:

 pytest-filter:  
 - "test_queue or experiment or exp"
 - "test_queue or experiment or exp"

That will run 20 (2x10) jobs. If you need more jobs, add more lines to pytest-filter. There may be other tests that are flaky than the one you have parametrized.

pmrowla · 2022-11-09T05:53:31Z

dvc/repo/experiments/queue/celery.py

+        MAX_RETRY = 5
+        for _ in range(MAX_RETRY):
+            for _, queue_entry in self._iter_done_tasks():
+                if queue_entry == entry:
+                    logger.debug("entry %s finished", entry.stash_rev)
+                    return
+            time.sleep(1)
+        logger.warning(
+            "Post process experiment %s time out with max retries %d.",
+            entry.stash_rev,
+            MAX_RETRY,
+        )


This doesn't belong in follow(), this will break the use case where we the user is using queue logs -f and ctrl-c's to stop viewing the logs (it should exit without waiting for the underlying task to finish).

If there are places that use follow() but we need to actually wait for the entire task to finish, we should really be doing something like

celery_queue.follow(entry) celery_queue.get_result(entry)

get_result() has better logic for waiting until the given entry is completed, we should avoid this kind of busy-wait sleep() whenever possible

Yes, the problem is that the get_result is leaky, we might get the result before the tasks are complete.

dvc/dvc/repo/experiments/queue/celery.py

Lines 255 to 265 in c01583f

def _load_collected(rev: str) -> Optional[ExecutorResult]:

executor_info = _load_info(rev)

if executor_info.status > TaskStatus.SUCCESS:

return executor_info.result

raise FileNotFoundError

try:

return _load_collected(entry.stash_rev)

except FileNotFoundError:

# Infofile will not be created until execution begins

pass

Here we directly look into the result directly without checking the AysncResult.ready() or waiting until``AysncResult.get()` is returned. And this is where the problem is.

codecov · 2022-11-09T08:23:03Z

Codecov Report

Base: 94.31% // Head: 93.98% // Decreases project coverage by -0.32% ⚠️

Coverage data is based on head (3dbf052) compared to base (929de7c).
Patch coverage: 94.87% of modified lines in pull request are covered.

❗ Current head 3dbf052 differs from pull request most recent head af35413. Consider uploading reports for the commit af35413 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8404      +/-   ##
==========================================
- Coverage   94.31%   93.98%   -0.33%     
==========================================
  Files         430      430              
  Lines       32840    32839       -1     
  Branches     4592     4587       -5     
==========================================
- Hits        30972    30864     -108     
- Misses       1448     1538      +90     
- Partials      420      437      +17

Impacted Files	Coverage Δ
dvc/repo/experiments/show.py	`91.89% <ø> (+1.08%)`	⬆️
dvc/repo/experiments/executor/base.py	`83.10% <50.00%> (+1.08%)`	⬆️
dvc/repo/experiments/queue/celery.py	`87.26% <86.95%> (-1.74%)`	⬇️
dvc/repo/experiments/executor/local.py	`89.07% <100.00%> (ø)`
dvc/repo/experiments/run.py	`97.43% <100.00%> (-0.07%)`	⬇️
tests/func/experiments/test_experiments.py	`99.71% <100.00%> (ø)`
tests/func/experiments/test_queue.py	`100.00% <100.00%> (ø)`
tests/func/experiments/test_show.py	`98.82% <100.00%> (+0.20%)`	⬆️
...ests/unit/repo/experiments/test_executor_status.py	`98.48% <100.00%> (+0.12%)`	⬆️
tests/func/test_unprotect.py	`78.57% <0.00%> (-21.43%)`	⬇️
... and 23 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

karajan1001 · 2022-11-10T09:47:44Z

@skshetry celery tests in 3.11 are flaky, and when I looked into their home page https://github.com/celery/celery, I find that they only declared to support 3.7 ~ 3.10, maybe we need to skip some of the tests in 3.11 ?

skshetry · 2022-11-10T10:05:57Z

@skshetry celery tests in 3.11 are flaky, and when I looked into their home page https://github.com/celery/celery, I find that they only declared to support 3.7 ~ 3.10, maybe we need to skip some of the tests in 3.11 ?

The error does look legit. Why is it trying to remove .dvc directory?

pmrowla · 2022-11-10T10:06:16Z

@skshetry celery tests in 3.11 are flaky, and when I looked into their home page https://github.com/celery/celery, I find that they only declared to support 3.7 ~ 3.10, maybe we need to skip some of the tests in 3.11 ?

@karajan1001 are they flaky (and sometimes pass) in 3.11 or do they always fail? Either way it is probably ok to just mark the celery tests with:

@pytest.mark.skipif(sys.version_info >= (3, 11), reason="celery unsupported in 3.11")

We may also need to consider disabling the queue related commands (or at least outputting a warning) if we are in 3.11, but that can be addressed in a separate issue (similar to how we don't support hydra functionality in 3.11)

skshetry · 2022-11-10T10:13:00Z

They are already failing in main, so we can ignore here. The issue seems unrelated to celery at a quick glance.

So no need to xfail/skip it, celery seems to be working fine on 3.11 (given it’s pure python). The failure looks to be our fault, we need to investigate it separately.

skshetry · 2022-11-10T11:29:57Z

Looking into it, it always passes for me in Windows (and, cls._repro_dvc is correctly closing all the state db related handles, so ResourceMonitor does not show any associated handles for cache.db at the end).

karajan1001 · 2022-11-10T12:10:49Z

@karajan1001 are they flaky (and sometimes pass) in 3.11 or do they always fail?

Always fail on Windows.
Always pass on ubuntu and MacOS.

They are already failing in main, so we can ignore here. The issue seems unrelated to celery at a quick glance.

Let's track them in a separate issue.

skshetry · 2022-11-10T12:33:33Z

@karajan1001, can you remove the changes to pytest-filter in github workflow? After @pmrowla approves, we can merge this.

pmrowla

Don't forget to also remove the @pytest.mark.parametrize("repeat", range(10)) usage in addition to the pytest-filter changes before merging

fix: iterative#8403 1. remove some of the flaky mark 2. In `get_result` make sure the celery task is completed.

1. Modify run all to include currently running exps. 2. bump dvc-task to 0.1.5

skshetry · 2022-11-10T14:10:13Z

The failure looks similar to python/cpython#97641, however I am not able to reproduce locally.

There seems to be a regression in Python 3.11, where the sqlite connections are not deallocated, due to some internal changes in Python 3.11, where they are now using LRU cache. They are not deallocated until `gc.collect()` is not called. See python/cpython#97641. This affects only Windows, because when we try to remove the tempdir for the exp run, the sqlite connection is open which prevents us from deleting that folder. Although this may happen in real scenario in `exp run`, I am only fixing the tests by mocking `dvc.close()` and extending it by calling `gc.collect()` after it. We could also mock `State.close()` but didnot want to mock something that is not in dvc itself. The `diskcache` uses threadlocal for connections, so they are expected to be garbage collected, and therefore does not provide a good way to close the connections. The only API it offers is `self.close()` and that only closes main thread's connection. If we had access to connection, an easier way would have been to explicitly call `conn.close()`. But we don''t have such option at the moment. Related: iterative#8404 (comment) GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57

karajan1001 · 2022-11-11T00:23:42Z

Looks like I can not force merge it. Required statuses must pass before merging

skshetry · 2022-11-11T00:49:44Z

I have a fix here #8547 that fixes the test.

There seems to be a regression in Python 3.11, where the sqlite connections are not deallocated, due to some internal changes in Python 3.11, where they are now using LRU cache. They are not deallocated until `gc.collect()` is not called. See python/cpython#97641. This affects only Windows, because when we try to remove the tempdir for the exp run, the sqlite connection is open which prevents us from deleting that folder. Although this may happen in real scenario in `exp run`, I am only fixing the tests by mocking `dvc.close()` and extending it by calling `gc.collect()` after it. We could also mock `State.close()` but didnot want to mock something that is not in dvc itself. The `diskcache` uses threadlocal for connections, so they are expected to be garbage collected, and therefore does not provide a good way to close the connections. The only API it offers is `self.close()` and that only closes main thread's connection. If we had access to connection, an easier way would have been to explicitly call `conn.close()`. But we don''t have such option at the moment. Related: iterative#8404 (comment) GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57

There seems to be a regression in Python 3.11, where the sqlite connections are not deallocated, due to some internal changes in Python 3.11, where they are now using LRU cache. They are not deallocated until `gc.collect()` is not called. See python/cpython#97641. This affects only Windows, because when we try to remove the tempdir for the exp run, the sqlite connection is open which prevents us from deleting that folder. Although this may happen in real scenario in `exp run`, I am only fixing the tests by mocking `dvc.close()` and extending it by calling `gc.collect()` after it. We could also mock `State.close()` but didnot want to mock something that is not in dvc itself. The `diskcache` uses threadlocal for connections, so they are expected to be garbage collected, and therefore does not provide a good way to close the connections. The only API it offers is `self.close()` and that only closes main thread's connection. If we had access to connection, an easier way would have been to explicitly call `conn.close()`. But we don''t have such option at the moment. Related: #8404 (comment) GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57

karajan1001 added ci A: experiments Related to dvc exp bugfix fixes bug labels Oct 6, 2022

karajan1001 requested a review from pmrowla October 6, 2022 10:05

karajan1001 self-assigned this Oct 6, 2022

karajan1001 force-pushed the fix_ci branch from e41e349 to 286d9f3 Compare October 7, 2022 01:44

pmrowla approved these changes Oct 7, 2022

View reviewed changes

pmrowla suggested changes Oct 7, 2022

View reviewed changes

skshetry reviewed Oct 7, 2022

View reviewed changes

dtrifiro changed the title ~~Fix some celery queue realted ci failure~~ Fix some celery queue related ci failure Oct 10, 2022

karajan1001 force-pushed the fix_ci branch 15 times, most recently from 03a80b4 to 8f02c4b Compare October 11, 2022 11:20

karajan1001 changed the title ~~Fix some celery queue related ci failure~~ [WIP] Fix some celery queue related ci failure Oct 11, 2022

karajan1001 force-pushed the fix_ci branch 2 times, most recently from b525749 to fe53d7a Compare October 11, 2022 12:18

karajan1001 changed the title ~~[WIP] Fix some celery queue related ci failure~~ Fix some celery queue related ci failure Nov 8, 2022

karajan1001 changed the title ~~Fix some celery queue related ci failure~~ [WIP]Fix some celery queue related ci failure. Nov 8, 2022

pmrowla reviewed Nov 9, 2022

View reviewed changes

karajan1001 force-pushed the fix_ci branch from 97bc92a to c1b2924 Compare November 9, 2022 08:31

karajan1001 requested a review from pmrowla November 9, 2022 08:36

karajan1001 force-pushed the fix_ci branch 3 times, most recently from 2278294 to 3dbf052 Compare November 10, 2022 06:06

pmrowla approved these changes Nov 10, 2022

View reviewed changes

karajan1001 added 2 commits November 10, 2022 21:43

Fix some celery queue realted ci failure

e3b1ac9

fix: iterative#8403 1. remove some of the flaky mark 2. In `get_result` make sure the celery task is completed.

Solve run_all flaky tests

af35413

1. Modify run all to include currently running exps. 2. bump dvc-task to 0.1.5

karajan1001 force-pushed the fix_ci branch from 3dbf052 to af35413 Compare November 10, 2022 13:46

skshetry mentioned this pull request Nov 10, 2022

tests: garbage collect sqlite connections on Windows + Python 3.11 #8547

Merged

karajan1001 changed the title ~~[WIP]Fix some celery queue related ci failure.~~ Fix some celery queue related ci failure. Nov 11, 2022

skshetry merged commit fa54c1a into iterative:main Nov 11, 2022

karajan1001 deleted the fix_ci branch November 11, 2022 01:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some celery queue related ci failure. #8404

Fix some celery queue related ci failure. #8404

karajan1001 commented Oct 6, 2022 •

edited

Loading

pmrowla left a comment

skshetry Oct 7, 2022

karajan1001 Oct 10, 2022

skshetry commented Oct 7, 2022

pmrowla Nov 9, 2022

karajan1001 Nov 9, 2022

codecov bot commented Nov 9, 2022 •

edited

Loading

karajan1001 commented Nov 10, 2022 •

edited

Loading

skshetry commented Nov 10, 2022

pmrowla commented Nov 10, 2022

skshetry commented Nov 10, 2022 •

edited

Loading

skshetry commented Nov 10, 2022 •

edited

Loading

karajan1001 commented Nov 10, 2022 •

edited

Loading

skshetry commented Nov 10, 2022

pmrowla left a comment

skshetry commented Nov 10, 2022

karajan1001 commented Nov 11, 2022

skshetry commented Nov 11, 2022

	def _load_collected(rev: str) -> Optional[ExecutorResult]:
	executor_info = _load_info(rev)
	if executor_info.status > TaskStatus.SUCCESS:
	return executor_info.result
	raise FileNotFoundError

	try:
	return _load_collected(entry.stash_rev)
	except FileNotFoundError:
	# Infofile will not be created until execution begins
	pass

Fix some celery queue related ci failure. #8404

Fix some celery queue related ci failure. #8404

Conversation

karajan1001 commented Oct 6, 2022 • edited Loading

pmrowla left a comment

Choose a reason for hiding this comment

skshetry Oct 7, 2022

Choose a reason for hiding this comment

karajan1001 Oct 10, 2022

Choose a reason for hiding this comment

skshetry commented Oct 7, 2022

pmrowla Nov 9, 2022

Choose a reason for hiding this comment

karajan1001 Nov 9, 2022

Choose a reason for hiding this comment

codecov bot commented Nov 9, 2022 • edited Loading

Codecov Report

karajan1001 commented Nov 10, 2022 • edited Loading

skshetry commented Nov 10, 2022

pmrowla commented Nov 10, 2022

skshetry commented Nov 10, 2022 • edited Loading

skshetry commented Nov 10, 2022 • edited Loading

karajan1001 commented Nov 10, 2022 • edited Loading

skshetry commented Nov 10, 2022

pmrowla left a comment

Choose a reason for hiding this comment

skshetry commented Nov 10, 2022

karajan1001 commented Nov 11, 2022

skshetry commented Nov 11, 2022

karajan1001 commented Oct 6, 2022 •

edited

Loading

codecov bot commented Nov 9, 2022 •

edited

Loading

karajan1001 commented Nov 10, 2022 •

edited

Loading

skshetry commented Nov 10, 2022 •

edited

Loading

skshetry commented Nov 10, 2022 •

edited

Loading

karajan1001 commented Nov 10, 2022 •

edited

Loading