gh-90155: Fix broken `asyncio.Semaphore` and strengthen FIFO guarantee. #93222

cykerway · 2022-05-25T16:34:08Z

gh-90155: Fix broken :class:`asyncio.Semaphore` and strengthen FIFO guarantee.

Current asyncio.Semaphore may become broken on certain workflow. Tasks waiting on a broken Semaphore can hang forever. This PR not only fixes this problem but also strengthens the FIFO guarantee on Semaphore waiters. Test cases show details.

mguentner · 2022-07-14T11:57:53Z

@cykerway Thank you for your work on this. I can confirm that this is indeed an issue in the wild.

I spend quite some time to debug an application that was blocking in rare cases after seeing cancelled tasks. Finally I checked the internal state of the Semaphore and saw that _wakeup_scheduled was True with no _waiters left to set it to False which blocks it forever.

Before I found this PR, I came up with another solution which I have documented here together with a program that reproduces the race condition (inspired by the tests added by this PR).

gvanrossum

I'm not going to lie, I don't understand this code well enough to approve it. :-(

I wonder where we went wrong in asyncio's design that the simple version (which has long been fixed and fixed again) didn't work. :-(

Lib/asyncio/locks.py

cykerway · 2022-09-15T08:20:09Z

I wonder where we went wrong in asyncio's design that the simple version (which has long been fixed and fixed again) didn't work. :-(

(If I remembered correctly)

The original implementation before asyncio.Semaphore waiters deque doesn't work #90155 was working but just not fifo. But the doc doesn't say semaphores are fair or fifo. People who don't need fifo semaphores can use the original one.
Someone was unsatisfied about task starvation and opened asyncio.Semaphore waiters deque doesn't work #90155. Changes introduced by it (such as 9d59381) were aimed to bring fairness to semaphores, but the implementation was flawed and introduced regression that makes the semaphore unusable in certain race conditions. This is what I showed in asyncio.Semaphore waiters deque doesn't work #90155 (comment)
Those patches have been officially merged into 3.10 and other versions. So semaphores are currently broken. I mean the python installed by system package managers, not nightly build here. This includes current version 3.10.7.
This PR is meant to be a hotfix to this problem. It doesn't mean to address fifo or non-fifo problem but tries to bring back a usable semaphore with minimal change on the existing implementation. Another way is to just revert everything to the very beginning, but I think the advantage of taking this PR than reverting everything is to prevent task starvation. And in either case the tests added by this PR are still useful in preventing future regressions. You are welcome to add more tests to ensure this PR itself doesn't bring more regressions.

gvanrossum

I approve of the new solution. I have some nits about the tests, not too sure about those so push back if you think I misunderstand how it works!

Lib/test/test_asyncio/test_locks.py

Misc/NEWS.d/next/Library/2022-05-25-15-57-39.gh-issue-90155.YMstB5.rst

gvanrossum

I think I've cracked the sleep(0.01) mystery.

Lib/test/test_asyncio/test_locks.py

Lib/asyncio/locks.py

Lib/test/test_asyncio/test_locks.py

`Semaphore` docstring says the counter can never go below zero.

gvanrossum

There are several more sleep(0.01) cases that should be sleep(0), and I had a suggestion for the wait_for(..., timeout=0.01) too.

bedevere-bot · 2022-09-18T22:48:55Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

kumaraditya303 · 2022-09-22T14:22:19Z

FYI 3.9 is now is in security fixes only so most likely this cannot be backport to 3.9.

cykerway · 2022-09-22T15:36:01Z

Do you think that version would pass all the tests you added?

That version can pass all the tests so far, and may look better if you have the "obsession over minimizing the number of sleep(0) calls needed to get progress". But that version was made without knowledge about the loop internals. I didn't find a spec anywhere. For example, the documentation doesn't not say what exactly happens when you cancel a task who is waiting on a future. And I don't know what could be dependable in that case. The new version is cleaner. Waking up waiters one by one is the safer way to go. Should you really want to wake up more and process them in batch, there better be a spec on async loops so we know what we (and users) can rely on.

gvanrossum · 2022-09-22T16:31:01Z

Okay, we'll go with this version.

miss-islington · 2022-09-22T16:34:48Z

Thanks @cykerway for the PR, and @gvanrossum for merging it 🌮🎉.. I'm working now to backport this PR to: 3.9, 3.10, 3.11.
🐍🍒⛏🤖

bedevere-bot · 2022-09-22T16:34:55Z

GH-97019 is a backport of this pull request to the 3.11 branch.

bedevere-bot · 2022-09-22T16:35:02Z

GH-97020 is a backport of this pull request to the 3.10 branch.

miss-islington · 2022-09-22T16:35:04Z

Sorry, @cykerway and @gvanrossum, I could not cleanly backport this to 3.9 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker 24e03796248ab8c7f62d715c28156abe2f1c0d20 3.9

…antee (pythonGH-93222) The main problem was that an unluckily timed task cancellation could cause the semaphore to be stuck. There were also doubts about strict FIFO ordering of tasks allowed to pass. The Semaphore implementation was rewritten to be more similar to Lock. Many tests for edge cases (including cancellation) were added. (cherry picked from commit 24e0379) Co-authored-by: Cyker Way <[email protected]>

gvanrossum · 2022-09-22T16:38:14Z

@cykerway Do you want to do the 3.9 backport by hand? We can also leave it be, 3.9 is in security-fix mode anyways.

…H-93222) The main problem was that an unluckily timed task cancellation could cause the semaphore to be stuck. There were also doubts about strict FIFO ordering of tasks allowed to pass. The Semaphore implementation was rewritten to be more similar to Lock. Many tests for edge cases (including cancellation) were added. (cherry picked from commit 24e0379) Co-authored-by: Cyker Way <[email protected]>

cykerway · 2022-09-22T17:37:29Z

Do you want to do the 3.9 backport by hand? We can also leave it be, 3.9 is in security-fix mode anyways.

No idea about that. I'm not quite familiar with the project management. I don't even know what the backport conflict is. Perhaps @kumaraditya303 is better on that topic.

kumaraditya303 · 2022-09-22T17:51:07Z

3.9 is in security fixes only now, so better just leave it as it is, no need to backport.

gvanrossum · 2022-09-22T18:12:08Z

That's cool.

gvanrossum · 2022-09-25T00:47:31Z

Okay, so after merging this I still couldn't stop thinking about it, and I came up with a scenario where this is substantially worse (orders of magnitude) than before.

The test program spawns oodles of trivial tasks and tries to rate-limit them by making each task acquire a semaphore first. The semaphore allows 50 tasks at a time. With python 3.11rc2, on my Mac it does nearly 25,000 iterations per second. With the new code it does about 900, or about 27 times slower.

The first 50 tasks take the easy path through acquire(), every following task gets put in the queue first. What makes it so slow is that once 50 tasks have acquired the semaphore in a single event loop iteration, they will also release it all in a single iteration -- but the new algorithm only wakes up the first waiting task, and the next 49 release() calls do nothing to the queue. So from then on we only wake up one task per iteration.

Of course, it's easy to complain when the broken code is fast. :-) But I think we can do better.

(UPDATE: The same code without using a semaphore can do over 200,000 loops/sec. Same if I keep the semaphore but set the initial value to 50,000. If I also remove the sleep(0) from the task it can do over 400,000 loops/sec. A loop that doesn't create tasks but calls sleep(0) and bumps the counter runs around 50,000.)

cykerway · 2022-09-25T05:27:07Z

Looks like the slowdown was caused by the context switch between tasks. I'm not familiar enough with the loop to estimate how much overhead there is when control is passed to loop then passed back. The new version implements fifo order without depending on loop guarantees and this is safer. It sacrifices performance because it is not aided by the loop.

There are several things here: correctness, performance, fifo. We definitely want correctness, and there is a tradeoff between performance and fifo. If you try my first version it runs almost as fast as the old one before this commit and it doesn't not seem to have starvation problem but it's not obviously fifo. To tackle this tradeoff there should be some explicit guarantee (about order of execution of tasks waiting on futures when those futures are awaken) from the loop yet I haven't seen it.

gvanrossum · 2022-09-25T14:56:17Z

From reading the code I know that the event loop absolutely calls callbacks that were scheduled using call_soon() in the order in which they were registered. I know that uvloop also uses this order. I feel that it's silly not to rely on this, since it gives us an important tool to fix the slowdown here. We should probably add a comment that states the dependency on rhis loop property.

kumaraditya303 · 2022-09-25T15:07:11Z

Can you create a new issue to discuss this? Your gist program looks like the worst case scenario for this as it does no io and just yields. Also what is the throughput impact when you do some real work in function?

gvanrossum · 2022-09-25T15:23:51Z

DO NOT FOLLOW UP HERE; DISCUSS AT #97545

Actually the behavior is guaranteed. Under Scheduling Callbacks I read

Callbacks are called in the order in which they are registered.

So we can definitely rely on this. (Sorry I hadn't mentioned this earlier, I wasn't actually aware of the guarantee, just of how the code works.)

Something that doesn't affect correctness but may affect performance is that the event loop goes through a cycle:

wait for I/O events (*)
register callbacks for I/O events
call all callbacks that are registered and ready at this point (I/O and otherwise)
go back to top

(*) The timeout for the I/O wait is zero if there are callbacks that are immediately ready, otherwise the time until the first callback scheduled for a particular time in the future (call_later()).

The I/O wait has expensive fixed overhead, so we want to call as many callbacks in a single iteration as possible.

Therefore I think a it behooves us to make all futures ready for which we have place. I think it can be like this abstract algorithm:

Definitions:
- Level: L = self._value
- Waiters: W = self._waiters
- Ready: R = [w for w in W if w.done() and not w.cancelled()]
- Cancelled: C = [w for w in W if w.cancelled()]
- Blocked: B = [w for w in W if not w.done()]
- Note that R, C and B are views on W, not separate data structures
Invariant that should hold at all times:
- L >= |R|
  (I.e., we should not promise more guests to seat than we have open tables)
Operations:
- Equalize: while |B| > 0 and L >= |R|: make the first item of B ready (move it to R)
- Release: L++; Equalize
- Acquire:
  - if L > 0 and |R| == 0 and |B| == 0: L--; return
  - create a future F, append it to B, await it
  - when awaken (with or without exception):
    - assertion: F should be in either R or C
    - remove F from W (hence from R or C)
    - if no exception caught: L--; Equalize; return
    - if CancelledException caught and F.cancelled(): return
    - if CancelledException caught and not F.cancelled(): Equalize; return
    - (other exceptions are not expected and will bubble out unhandled)

DO NOT FOLLOW UP HERE; DISCUSS AT #97545

gvanrossum · 2022-09-25T15:28:15Z

(It's not really resolved, but I needed to give a reason and that was less wrong than the other options GitHub gave me.)

Fix broken :class:asyncio.Semaphore and strengthen FIFO guarantee.

823f1c7

cykerway requested review from 1st1 and asvetlov as code owners May 25, 2022 16:34

bedevere-bot added the awaiting review label May 25, 2022

cykerway changed the title ~~Fix broken :class:asyncio.Semaphore and strengthen FIFO guarantee.~~ [3.12] GH-90155 Fix broken :class:asyncio.Semaphore and strengthen FIFO guarantee. May 25, 2022

cykerway changed the title ~~[3.12] GH-90155 Fix broken :class:asyncio.Semaphore and strengthen FIFO guarantee.~~ [3.12] GH-90155: Fix broken :class:asyncio.Semaphore and strengthen FIFO guarantee. May 25, 2022

cykerway mentioned this pull request May 25, 2022

asyncio.Semaphore waiters deque doesn't work #90155

Closed

cykerway changed the title ~~[3.12] GH-90155: Fix broken :class:asyncio.Semaphore and strengthen FIFO guarantee.~~ gh-90155: Fix broken asyncio.Semaphore and strengthen FIFO guarantee. May 26, 2022

AA-Turner added the topic-asyncio label May 26, 2022

Fix locked method.

db55bc5

gvanrossum reviewed Sep 14, 2022

View reviewed changes

Lib/asyncio/locks.py Outdated Show resolved Hide resolved

Lib/asyncio/locks.py Outdated Show resolved Hide resolved

Response to code review r967894274 and r967894421.

0ada9f7

cykerway added 2 commits September 17, 2022 16:13

Implement Semaphore using patched Lock implementation.

2192dc7

Fix asyncio test.

c7513a5

gvanrossum reviewed Sep 17, 2022

View reviewed changes

cykerway added 2 commits September 17, 2022 17:34

Include updates from code review.

9932c4b

Include code review changes.

9a6a053

gvanrossum reviewed Sep 18, 2022

View reviewed changes

Lib/test/test_asyncio/test_locks.py Outdated Show resolved Hide resolved

Lib/test/test_asyncio/test_locks.py Outdated Show resolved Hide resolved

Lib/asyncio/locks.py Show resolved Hide resolved

gvanrossum reviewed Sep 18, 2022

View reviewed changes

Lib/test/test_asyncio/test_locks.py Outdated Show resolved Hide resolved

gvanrossum reviewed Sep 18, 2022

View reviewed changes

Lib/test/test_asyncio/test_locks.py Outdated Show resolved Hide resolved

Lib/test/test_asyncio/test_locks.py Outdated Show resolved Hide resolved

cykerway added 3 commits September 18, 2022 16:50

Code review.

59f0516

Code review.

a90aa69

Update Semaphore.locked.

f63d130

`Semaphore` docstring says the counter can never go below zero.

gvanrossum requested changes Sep 18, 2022

View reviewed changes

bedevere-bot removed the awaiting review label Sep 18, 2022

bedevere-bot added the awaiting changes label Sep 18, 2022

cykerway force-pushed the gh-issue-90155 branch from d268b89 to f63d130 Compare September 18, 2022 23:54

gvanrossum merged commit 24e0379 into python:main Sep 22, 2022

bedevere-bot removed awaiting merge needs backport to 3.11 only security fixes labels Sep 22, 2022

bedevere-bot removed the needs backport to 3.10 only security fixes label Sep 22, 2022

miss-islington assigned gvanrossum Sep 22, 2022

gvanrossum removed the needs backport to 3.9 only security fixes label Sep 22, 2022

gvanrossum mentioned this pull request Sep 25, 2022

Should asyncio.{Lock,Semaphore}.locked() return True when there are waiters? #97028

Closed

gvanrossum mentioned this pull request Sep 25, 2022

Semaphore slowdown #97545

Closed

python locked as resolved and limited conversation to collaborators Sep 25, 2022

Uh oh!

gh-90155: Fix broken asyncio.Semaphore and strengthen FIFO guarantee. #93222

gh-90155: Fix broken asyncio.Semaphore and strengthen FIFO guarantee. #93222

Uh oh!

Conversation

cykerway commented May 25, 2022

gh-90155: Fix broken :class:asyncio.Semaphore and strengthen FIFO guarantee.

Uh oh!

mguentner commented Jul 14, 2022

Uh oh!

gvanrossum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cykerway commented Sep 15, 2022

Uh oh!

gvanrossum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gvanrossum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gvanrossum left a comment

Choose a reason for hiding this comment

Uh oh!

bedevere-bot commented Sep 18, 2022

Uh oh!

kumaraditya303 commented Sep 22, 2022

Uh oh!

cykerway commented Sep 22, 2022

Uh oh!

gvanrossum commented Sep 22, 2022

Uh oh!

miss-islington commented Sep 22, 2022

Uh oh!

bedevere-bot commented Sep 22, 2022

Uh oh!

bedevere-bot commented Sep 22, 2022

Uh oh!

miss-islington commented Sep 22, 2022

Uh oh!

gvanrossum commented Sep 22, 2022

Uh oh!

cykerway commented Sep 22, 2022

Uh oh!

kumaraditya303 commented Sep 22, 2022

Uh oh!

gvanrossum commented Sep 22, 2022

Uh oh!

gvanrossum commented Sep 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cykerway commented Sep 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gvanrossum commented Sep 25, 2022

Uh oh!

kumaraditya303 commented Sep 25, 2022

Uh oh!

gvanrossum commented Sep 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gvanrossum commented Sep 25, 2022

Uh oh!

Uh oh!

gh-90155: Fix broken `asyncio.Semaphore` and strengthen FIFO guarantee. #93222

gh-90155: Fix broken `asyncio.Semaphore` and strengthen FIFO guarantee. #93222

gh-90155: Fix broken :class:`asyncio.Semaphore` and strengthen FIFO guarantee.

gvanrossum commented Sep 25, 2022 •

edited

Loading

cykerway commented Sep 25, 2022 •

edited

Loading

gvanrossum commented Sep 25, 2022 •

edited

Loading