Skip to content

Document whether asyncio.wait_for(q.get(), timeout) is safe of race-conditions. #92824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Yaakov-Belch opened this issue May 15, 2022 · 12 comments
Labels
docs Documentation in the Doc dir topic-asyncio

Comments

@Yaakov-Belch
Copy link

Yaakov-Belch commented May 15, 2022

Documentation

Please add the following text (or some equivalent) to the documentation of asyncio.Queue:

The construct asyncio.wait_for(queue.get(), timeout) is guaranteed to be free of race conditions. It will not lose queue items.

Explanation: According to the public documentation, this construct is allowed to lose queue items: The documentation states that wait_for uses cancellation which is implemented by raising a CancelledError exception. Depending on the implementation, this may happen in the call queue.get() after an item has been removed from the queue. This would cause the extracted queue item to be lost.

According to the current documentation, this race condition is allowed: Such a race condition would be surprising for most programmers --- but it would not violate the letter of the documentation. Hence, judging only the documented features, a responsible programmer will hesitate to use this construct in production settings where losing queue items is not acceptable.

This issue has been reported and addressed at least twice:

Based on the comments in these two bug reports it seems to me that the current implementation avoids this race condition. However, this is not clearly spelled out as a commitment for future versions.

Conclusion
At the current state of affairs, a responsible programmer will find no assurance in the documentation of whether the construct asyncio.wait_for(queue.get(), timeout) can safely be used in production environments. He will need to spend much time (I spend more than two hours) to search for bug reports such as listed above and still be left with doubts about whether he can safely use the simple construct or still needs to implement a safe queue-with-timeout-in-get.

If the python implementation promises to avoid this race condition --- please state it clearly in the documentation.

@Yaakov-Belch Yaakov-Belch added the docs Documentation in the Doc dir label May 15, 2022
@kumaraditya303
Copy link
Contributor

FWIW as #90927 is implemented, asyncio.wait_for is likely to be deprecated in favor of asyncio.timeout.

@gvanrossum What do you think about deprecating asyncio.wait_for?

@Yaakov-Belch
Copy link
Author

@kumaraditya303 -- Thank you very much for contributing to this discussion.

If you want, we can formulate my documentation request more generally and write:

The queue.get() API is cancellation-safe: It guarantees not to remove any items from the queue (or return to the queue any removed items) in case that this call is interrupted by an asyncio.CancelledError.

This would cover any cancellation-based timeout implementation. Of course, it's important that this promise is actually kept.

In fact, I feel quite uneasy about using exception-based mechanisms as the only way to achieve a timeout at an elementary operation such as queue.get(). My strong preference would be to have some queue.get() API that provides a timeout without using cancellations or exceptions.

@gvanrossum
Copy link
Member

FWIW as #90927 is implemented, asyncio.wait_for is likely to be deprecated in favor of asyncio.timeout.

@gvanrossum What do you think about deprecating asyncio.wait_for?

I think it's way too early to think about that. I would think there's a lot of code using wait_for() that works just fine and we shouldn't force them to to change their code by starting to deprecate.

What we could do would be to add something to the docs for wait_for() that shows how to write the equivalent using timeout(). But for that to make sense, we first need to add the docs for timeout(). :-)

I haven't quite grasped the key topic of this issue yet, will respond about that later.

@gvanrossum
Copy link
Member

gvanrossum commented May 24, 2022

Oops, didn't mean to close. (Seems I discovered a new GitHub keyboard shortcut. :-)

@Yaakov-Belch
Copy link
Author

@gvanrossum -- Let me explain the reason why I opened this ticket: Confidently writing correct code. With the current documentation of the existing functionality, I cannot be confident that my code does what I want -- and it is very difficult to work around that. Let me explain with an example:

async with timeout_at(evening):
    while True:
        task = await task_queue.get()
        if await is_paying_customer(task.user):
            await priority_queue.put(task)
        else:
            await standard_queue.put(task)

Let's ignore for the time being the timeout context. Without it, it is quite easy to keep some promises about the outcome of the code:

  • Every task that is submitted to the task_queue will either stay in this queue or end up in exactly one of: priority_queue or standard_queue.
  • For every task that is routed, the function is_paying_customer(user) is called exactly once. It can be used to collect statistics about how many tasks a user submitted.

Now, the timeout manager adds another promise (the program will complete at evening) but it breaks the above promises that are important to me. This means: My code is not cancellation-safe. More than that: even if my code would be cancellation-safe, I would also depend on whether the function is_paying_customer() is cancellation-safe. I may have to shield this function call from cancellation in order to obtain correctness of my program.

Currently, it is difficult to find documentation about how to write cancellation-safe code with python3. What tools are available for this goal? What primitives are cancellation-safe and what primitives are not? What are the precise promises that I can rely on when a cancellation-safe library function is cancelled? Maybe there is documentation about these questions and I just haven't found it.

While the method queue.get() advises to use cancellation to implement timeout --- it does not promise to be cancellation-safe. Knowing that this is a complex/messy question, I did not want to simply believe that it 'just works': Keeping such a promise requires a conscious effort of the programmers who implement and maintain the queue.get() API. I found that, indeed, earlier python implementations of this API were not cancellation-safe (see the bug reports quoted above). It seems to me now, that the implementation is intended to address this issue consistently. However, I did not find an explicit statement to that effect.

As of the above algorithm, the easiest solution is to limit the timeout scope just to one call: await task_queue.get() (provided that this call is cancellation-safe). Semantically, this is equivalent to add an optional timeout argument to the queue.get() API.

I know that implementing timeouts consistently is very complex and messy: Timeout simply do not compose cleanly. Adding an optional timeout argument to the queue.get() API solves my current problem but still leaves other needs unanswered.
I am fine with any solution that makes sense to you (and I would be happy to contribute to the discussion).

However, it's essential to document clearly what can be relied upon and what cannot. Without such documentation, the available features cannot be used confidently.

@gvanrossum
Copy link
Member

Hi @Yaakov-Belch,

Knowing what can be relied upon to be cancellation-safe is really hard, and I think it's usually better if users write code in a defensive way. Making promises about this is even harder, since there's no way to reliably test such assurances (there just are too many points where code might be cancelled, as you've demonstrated in your own 7-line code snippet).

Queue data types are especially hard to get right -- for example, the non-async Queue type has a SimpleQueue sibling that "provides additional guarantees in exchange for the smaller functionality" according to the docs, and even there the docs have to add qualifications like "except for potential low-level errors such as failure to allocate memory".

I do know one thing (hopefully you already figured this out). The .get() method on asyncio.Queue has no timeout parameter because we want you to use the generic timeout primitives.

All in all I am not inclined to add such guarantees to the documentation. If you're really worried about losing a queue item occasionally you should probably use a persistent queue implementation.

@gvanrossum gvanrossum reopened this May 24, 2022
@Yaakov-Belch
Copy link
Author

Hi @gvanrossum ,

I fully understand your reasoning -- the costs (developer attention and time) involved in promising more -- and keeping such promises. Would it be fair to at least put this into the documentation?

Due to possible race conditions, .get() method may occasionally lose queue items when used in conjunction with the generic timeout primitives. If you need guarantees, consider alternative implementations.

Such a statement (at the same place where the use of generic timeout primitives is recommended) would have saved me hours of work: Instead of searching for bug reports and hints what's implemented or not -- I would have gone straight to implementing a workaround for my concrete use case.

@gvanrossum
Copy link
Member

We would have to add that warning to every single API in asyncio. I don't think that's reasonable. Like all aspects of Python, it's open source and you use it at your own risk (read the license).

It is looking more and more like the only point of this issue is for you to have something or someone to blame for the two hours you wasted. If you knew how much time I wasted debugging trivial bugs in my own code you would understand that I don't think that's a lot of wasted time. So I am closing this issue.

@gvanrossum gvanrossum closed this as not planned Won't fix, can't repro, duplicate, stale May 24, 2022
@Yaakov-Belch
Copy link
Author

Thanks for the clarification. I was not actually suggesting to add a warning to every API call in asyncio. I would like to see that warning at the one place where the documentation suggests the usage of generic cancellation-based timeout primitives.

I very much respect your giant work creating python -- and the huge contribution of the community. If it appears that I intend to blame someone for something then this is certainly my mistake in expressing myself and not my actual intention.

@Yaakov-Belch
Copy link
Author

Let me clarify what my actual intention was -- and resolve this issue in a way that's good for me:

I am developing a library to be intended to be used by my clients in production. "Normal python bugs" are extremely rare and acceptable. But possible race conditions are not.

At some point I need an asynchronous queue.get() with a timeout. My question was: Shall I use the asyncio.Queue with the suggested call asyncio.wait_for(queue.get(), timeout) or shall I implement a separate queue that has a .get(timeout) API?

As explained above: This depends on whether async_io.Queue.get() is cancellation-safe or not. If it is cancellation-safe, the pythonic way is to use the standard library features. If it may not be cancellation-safe, I rather be defensive and implement my own queue.

My search revealed that:

  • Guaranteeing cancellation-safety is hard -- no library features make any promises here.
  • There have been bug reports of async_io.Queue.get() being not cancellation-safe.
  • It seems that the problem has been fixed -- and the current library API is most likely cancellation-safe. However, there is no documented intention to ensure that it stays so in the future.
  • Our conversation clarified that --understandably-- taking a public commitment to maintain the cancellation-safety of this API is not acceptable. If it is important that no queue items are lost, alternative solutions should be considered.

This answers my question: To be certain, I should implement a separate queue with the features and promises that I need.

Here is the starting point of a SimpleQueue implementation that has only the features I need --- without using exceptions or cancellations:

_default_future = asyncio.Future()
_default_future.set_result(None)

@dataclass
class SimpleQueue:
    queue:  list                     = field(default_factory=list, repr=False, init=False)
    future: Optional[asyncio.Future] = field(default=None, repr=False, init=False)

    def put_nowait(self, item):
        self.queue.append(item)
        if self.future is not None:
            old_future, self.future = self.future, asyncio.Future()
            old_future.set_result(None)

    async def get(self, timeout=None, default_future=_default_future):
        t0 = time.time()
        while len(self.queue) == 0:
            if timeout is None:
                dt = None
            else:
                dt = timeout - (time.time() - t0)
            if self.future is None:
                self.future = asyncio.Future()
            done, pending = await asyncio.wait([self.future], timeout=dt)
            if not done:
                return default_future.result()
        return self.queue.pop(0)

The call queue.get(timeout=timeout) returns None when the timeout expires. You can receive any other value (or raise an exception) in this case by providing a suitable resolved default_future argument.

This implementation promises:

  • Any items added with queue.put_nowait() are either returned by queue.get() or stay in the queue for future results.
  • queue.get(timeout) will return a valid item if one is available before the timeout expires.

Summary: Any programmer who finds himself with the same question that I had: "Shall I use the recommended features of the standard library or shall I write my own queue implementation?" can find this closed documentation request and can assess the costs and risks involved with both sides of the choice.

He can use my code as a starting point for his own SimpleQueue implementation (if this is desired). Two warnings, though:

  • Take the time to convince yourself that the logic of the code is actually providing what you want.
  • If you want to add features for a full-blown async Queue implementation --- expect this to become complex and difficult to reason about.

Even though this request is now closed as 'not planned' it achieves my goal of clearly documenting of whether cancellation-safety is a promised feature of the standard library.

I want to thank @gvanrossum and @kumaraditya303 for their time and feedback.

@ljw1004
Copy link

ljw1004 commented Mar 27, 2024

I do know one thing (hopefully you already figured this out). The .get() method on asyncio.Queue has no timeout parameter because we want you to use the generic timeout primitives.

@gvanrossum I don't think I understand this answer. Haven't we established that the generic timeout primitive asyncio.wait_for(q.get(), timeout) is unsafe because it may drop queue items? and therefore users should avoid the generic timeout primitive here?

I think it's usually better if users write code in a defensive way. Making promises about this is even harder, since there's no way to reliably test such assurances

If I understand you right, "defensive way" here means that users should defend themselves against the possibility that asyncio.wait_for(q.get(), timeout) will drop an item?

@gvanrossum
Copy link
Member

No, my understanding is that the last time this construct was reported to lose items was in 2015, and both issues look closed to me, so no, I don't believe it has been established that this construct can still lose items.

I invite you to look at the source code for asyncio/queues.py and asyncio/timeouts.py and demonstrate a race condition. (Note that the implementation of wait_for() currently just defers to the timeout context manager.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir topic-asyncio
Projects
None yet
Development

No branches or pull requests

4 participants