Skip to content

Replace queue with linked list #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

willmmiles
Copy link

Replace the bounded queue with a linked list and "condvar" implementation, and replace the closed_slots system with double indirection via AsyncClient's own memory. This allows the system to correctly handle cases where it is not possible to allocate a new event while guaranteeing that the client's onDisconnect() will be run to free up any other related resources.

Key changes:

  • Replaces FreeRTOS fixed-size queue with a dynamic allocation list;
  • Removes CONFIG_ASYNC_TCP_QUEUE_SIZE queue size limit;
  • Makes AsyncClient's intrusive list support optional via CONFIG_ASYNCTCP_HAS_INTRUSIVE_LIST
  • Replace the closed slot system with a double indirection on AsyncClient's own _pcb member. Once initialized, this member is written only by the LwIP thread, preventing most races; and the AsyncClient object itself is never deleted on the LwIP thread.
  • Internal callbacks are made private

This draft rebases the changes from willmmiles/AsyncTCP:master to this development line. As this project is moving faster than I can keep up with, in the interests of making this code available for review sooner, I have performed only minimal testing on this port. It is likely there is a porting error somewhere.

Known issues as of this writing:

  • AsyncClient::operator=(const AsyncClient&) assignment operator is removed. The old code had implemented this operation with an unsafe partial move sematic, leaving the old object holding a dangling reference to the pcb. It's not clear to me what should be done here - copies of AsyncClient are not generally meaningful.
  • Poll coaelscing is not yet implemented. This could be done with a list walk, as before, or via storing pending poll event pointers in the AsyncClient objects.
  • Move construction and assignment is not yet implemented. It will require careful interlocking with the LwIP and async threads.
  • I'm about 85% confident there's still a race somewhere when an AsyncClient is addressed from a third task (eg. from an Arduino loop(), not LwIP or the async task). The fact that LwIP reserves the right to invalidate tcp_pcbs on its thread at any time after tcp_err makes this extremely challenging to get both strictly correct and performant. Core operations that pass to the LwIP thread are safe, but I think state reads (state(), space(), etc.) are still risky.
  • If strict memory bounds are required, a soft queue size limit could be reimplemented, with the caveat that each client's _end_event can ignore the limit to ensure onDisconnect gets run.

Future work:

  • lwip_tcp_event_packet_t::dns should be unbundled to a separate event object type from the rest of the lwip_tcp_event_packet_t variants. It's easily twice the size of any of the others; this will reduce the memory footprint for normal operation.
  • Event coaelscing could be considered for other types -- recv and send also permit sensible aggregation.

Replace the bounded queue with a linked list and condvar
implementation, and replace the closed_slots system with double
indirection via AsyncClient's own memory.  This allows the
system to correctly handle cases where it is not possible to allocate a
new event and still ensure that the client's `onDisconnect()` will
be queued for execution.
@mathieucarbou mathieucarbou requested a review from a team February 5, 2025 14:32
@mathieucarbou
Copy link
Member

Wow! Thanks a lot!
@vortigont and @me-no-dev will surely have a look!

@mathieucarbou
Copy link
Member

I tested this implementation in ESPAsyncWebServer:

  • autocannon -c 16 -w 16 -d 20 http://192.168.4.1 => works, about same perf

  • autocannon -c 64 -w 32 -d 20 -t 30 http://192.168.4.1 => works fine, no errors, but some requests dropped, aI see a lot of:

[327202][E][AsyncTCP.cpp:1535] _accept(): TCP ACCEPT FAIL

image
  • sse perf: for i in {1..16}; do ( count=$(gtimeout 30 curl -s -N -H "Accept: text/event-stream" http://192.168.4.1/events 2>&1 | grep -c "^data:"); echo "Total: $count events, $(echo "$count / 4" | bc -l) events / second" ) & done;

slower (about 50-100 events per second less) but i I did not see any queue overflow

  • slow requests (SlowChunkResponse.ino) works

  • request continuation works

  • and no memory leak for all, while when I run those tests on main, the heap decreases a lot.

  • websocket also works

So to me, this is a hug improvement over what we have.

src/AsyncTCP.h Outdated
#ifndef CONFIG_ASYNC_TCP_MAX_ACK_TIME
#define CONFIG_ASYNC_TCP_MAX_ACK_TIME 5000
#endif

#ifndef CONFIG_ASYNCTCP_HAS_INTRUSIVE_LIST
#define CONFIG_ASYNCTCP_HAS_INTRUSIVE_LIST 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the use of that ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First porting error - both the name and the implementation are bad. :(

There's a strange feature in the original code: AsyncClient has an intrusive list integrated (the prev/next pointers), but it's unclear as to why or what it's useful for. I removed it in my branch to save RAM as nobody in WLED (AsyncWebServer, AsyncMQTT) actually uses it for anything. I was trying to make it conditional, but default on. Will fix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(default on for backwards compatibility, in case there is someone using it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I don' see any reason to keep this linked list... Do you @me-no-dev ? This is not even used.

mathieucarbou

This comment was marked as off-topic.

@mathieucarbou
Copy link
Member

@willmmiles : fyi as seen with @me-no-dev we will merge #19 first, do a v3.3.4, then focus on reviewing / merging you PR.
And we will release a new 4.0.0 once merged.
So you can remove the operator overrides, and prev/next fields that are anyway unused internally.
Thanks 👍

@willmmiles
Copy link
Author

@willmmiles : fyi as seen with @me-no-dev we will merge #19 first, do a v3.3.4, then focus on reviewing / merging you PR. And we will release a new 4.0.0 once merged. So you can remove the operator overrides, and prev/next fields that are anyway unused internally. Thanks 👍

Sounds good, will do!

@vortigont
Copy link

I had just a short glimpse here, an interesting approach indeed. But also have a lot of things to help me to understand.
First I like the idea of giving up the slot system, but replacing the rtos queue with a poor-man's linked list brings all the burden of it's integrity into lib's code, which is - sync/locking, length management, etc...

first thing first - how the list size is controlled there? Does it have any limits or it can grow as much as resources are there?

@willmmiles
Copy link
Author

First I like the idea of giving up the slot system, but replacing the rtos queue with a poor-man's linked list brings all the burden of it's integrity into lib's code, which is - sync/locking, length management, etc...
first thing first - how the list size is controlled there? Does it have any limits or it can grow as much as resources are there?

This draft makes no attempt to limit the queue length, so long as there's heap available. A strictly fixed size queue is impractical because we must ensure that disconnection events cannot be dropped, or resources will leak. It's possible to add a soft upper limit on non-critical events, but it didn't seem to be worth the extra complexity (or having to explain an arbitrary resource limit that's independent of what the heap can actually service). Implementing event coalescence for poll, recv, and sent events will put a functional upper bound on the queue size as well, based on the number of open connections.

The rationale to replace the queue breaks down as:

  • (Correctness) There exist events that must never be dropped, so maintaining a strict upper limit requires an extremely complicated dance to invalidate accepted events (and abort yet more connections!) if the queue should fill up
  • (Correctness) The "purge events on close" sequence requires an external lock on the queue, or there's a potential risk that events will be re-inserted out of order if new events arrive during the operation. Using the RTOS queue means paying the locking penalty twice (since it has an internal lock).
  • (Correctness) Available list libraries in std will crash if allocation fails.
  • (Performance) The "purge events on close" requires locking the queue only once, instead of 2*N times
  • (Performance) The local queue implementation minimizes the amount of code space required -- it is significantly smaller than std::forward_list
  • (Simplicity) The "purge events on close" becomes trivial code.
  • (Simplicity) Once an event has been allocated, it is guaranteed that it can be added to the queue; this eliminates all the code paths that have to free the new event should the queue turn out to be full.

I'm usually the first to recommend using library implementations for classic data structures; ultimately I judged that the maintenance burden for the limited requirements here was less than the cost of bringing in some external library, given that std is unusable for my use case (specifically: not crashing when low on memory). If you know of a suitable alternative library (especially one that allows pre-allocation of items before locking the queue) and you're happy with bringing in a new dependency, I've otherwise got no objections.

Replacing the close_slot system is otherwise straightforward (it's nothing more than strict ownership of the pcb* by the LwIP thread), but it is contingent on guaranteed disconnect event delivery. I did originally implement these as separate changes, but since it wasn't going to merge cleanly with the development line from my old fork, I judged it wasn't worth trying to break it down in to separate commits.

@mathieucarbou
Copy link
Member

mathieucarbou commented Feb 6, 2025

@willmmiles : fyi asynctcp is released and we'll do an asyncwebserver release tomorrow with the current v3.3.5 of asynctcp.

So we are good to refresh this PR and have time to reviewed / merged.

When i tested with autocannon: autocannon -c 64 -w 32 -d 20 -t 30 http://192.168.4.1, or different variants at 16 connections or more, I always found this implementation more stable, compared to the current one in main, which as these issues:

  • memory leak, and a lot! Impossible to run twice the same test
  • and logs some error logs regarding inability to allocate, ack timeout, etc

I didn't test the client part yet. Will do later.

No reason for this to require a function call.
Use new(std::nothrow) instead.
If any of the TCP callbacks fails to allocate, return ERR_MEM.
src/AsyncTCP.cpp Outdated
tcp_recv(pcb, &_tcp_recv);
tcp_sent(pcb, &_tcp_sent);
tcp_err(pcb, &_tcp_error);
tcp_poll(pcb, &_tcp_poll, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use a constant here ? it was CONFIG_ASYNC_TCP_POLL_TIMER before

Copy link
Author

@willmmiles willmmiles Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# 2 on the merge failure list!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2321755

Comment on lines +789 to +790
tcpip_api_call(_tcp_connect_api, (struct tcpip_api_call_data *)&msg);
return msg.err == ESP_OK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an issue with the client part which is not working (testing with the Client.ino).

connect() returns false from here.

I've changed the code to get the error:

  tcpip_api_call(_tcp_connect_api, (struct tcpip_api_call_data *)&msg);
  if(msg.err != ERR_OK) {
    log_e("tcpip_api_call error: %d", msg.err);
  }
  return msg.err == ESP_OK;
[  1305][E][AsyncTCP.cpp:791] _connect(): tcpip_api_call error: -16

-16 is the invalid arg error.

Hope that helps!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there's merge failure #3 - I missed an underscore (and put the lines in the wrong place) when importing the locking code.

Fixed in 26e49f6

Should be using the config setting.
We can store the newly allocated handle directly in the object member.
We don't plan on writing to it - might as well save the copying.
@willmmiles
Copy link
Author

I've put a prototype event coalescing branch at https://github.com/willmmiles/AsyncTCP/tree/replace-queue-coalesce . My basic tests seem to work, but I don't yet have a good test case to really exercise the new logic.

@vortigont
Copy link

@willmmiles agreed to most of your points above, with proper coalescing code the events chain size cap won't be that critical. And the benefits of correctness does worth the efforts on proper locking code. Good job indeed!

I'm not sure I understood your point on why std::fwd_list is not usable here, list node could be created with proper alloc exception handling, then locking the queue and moving the node to list chain. I do not see much problem here, but on the second thought it might be a good choice to use simplified list structs code here. std containers in esp32 is linked against a quite large lib that can't be stripped down and affects fw size considerably. On one hand we already use those containers in AsyncWebServer code, so it will bring cxx lib anyway, but on the other hand if we consider AsyncTCP as a standalone lib that might be used elsewhere it would be wise to minimize the dependencies that considerably affects resulting fw size. Point taken!

@me-no-dev
Copy link
Member

Hi folks! Just trying to understand the issues that this PR is hoping to fix. Is it just the fact that events might be missed, because Queue is overflown, or are there also other things that are being fixed also? Please excuse my ignorance. I've been out of the game for a bit :)

@mathieucarbou
Copy link
Member

mathieucarbou commented Feb 7, 2025

Hi folks! Just trying to understand the issues that this PR is hoping to fix. Is it just the fact that events might be missed, because Queue is overflown, or are there also other things that are being fixed also? Please excuse my ignorance. I've been out of the game for a bit :)

Here are the problematic behaviour that we currently have and that this implementation seems to solve

  • Memory leak: currently asynctcp has one or many memory leaks which can be easily seen by running the Perf example with autocannon running like 32 concurrent connections. A big amount of heap is lost, and the test cannot be run twice: MCU crashes because cannot allocate
  • Running several concurrent connections with asynctcp also creates some ack timeout I saw sometimes and memory errors linked to the above.
  • SSE events lost: currently, when running 16 concurrent SSE connections, the queue fills up quickly and discards some events, while I do not see events discarded with this implementation. The send rate with this implementation is a little higher so that might be the reason

Plus all the valid points Will explained above. I think it makes sense to inclue his changes following the many efforts he did in stabilising this library in the case of WLED.

@willmmiles : I will retest your changes and the replace-queue-coalesce branch also, we have 2 use cases.

This simplies the queue operations.
@willmmiles
Copy link
Author

I've found the race: if the async thread is in the middle of processing an event that results in destruction of the AsyncClient (fin, or a sent that completes a transaction) when an a _tcp_error comes in, the _end_event gets freed before dequeuing and bad things happen. I have a fix in mind but it'll take another day or two to get enough hours to fully validate it.

For clarity of implementation
Eliminates any possibility that a client may have its end event queued
twice.  LwIP should prevent this anyways, but it never hurts to be safe.
src/AsyncTCP.cpp Outdated
// The associated pcb is now invalid and will soon be deallocated
// We call on the preallocated end event from the client object
lwip_tcp_event_packet_t *e = AsyncClient_detail::invalidate_pcb(*client);
assert(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could e be null here if client._end_event is null following a bad allocation ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_tcp_error isn't bound to the pcb unless the end_event could be allocated.

Ensure that the discard event is run once and exactly once for each
client.
It's unclear why this might be necessary - LwIP oughtn't be triggering
new callbacks after the error callback; but just to be on the safe
side, make sure we have no dangling references in the pcb we are
releasing.

Related to the changes in me-no-dev#31.
The existing TCP_MUTEX_(UN)LOCK macros are not reentrant nor safe
to call on code that might get run on the LwIP thread. Convert to a C++
lock_guard-style object, which is safer. cleans up some code paths,
and correctly supports AsyncClient's constructor.
This gives access to AsyncClient internals, preparing for O(1)
coalescing.
This is preparing to expand the guard scope to include other client
object members for event coaelscense.
If one of these events is pending on a client, subsequent callbacks can
be safely rolled together in to one event.
It's redundant in this design; we block forever
Add a feature to use the LwIP lock for the queue lock.  This reduces the
load on the LwIP thread, but might have the async thread wait longer.
Saves a little space when CONFIG_ASYNC_TCP_QUEUE_LWIP_LOCK is set
@willmmiles
Copy link
Author

This seems to be properly stable for me, now - the race is resolved and I've put in extra checking with the discard callback to catch any as yet unconsidered cases for double-free. I've also merged an equivalent to #31, to be on the safe side; as I understood the docs, LwIP shouldn't trigger any more callbacks after tcp_err, but it doesn't hurt (much) to do the extra cleanup to ensure our own invariants are met.

I pulled the queue code out to a separate file to make it easier to inspect. I actually tried a version that used std::unique_ptr<> to validate the event pointer management; it functioned, though at a code size penalty and performance regression. :( I don't love that the header is exposed to users; it's easy enough to move the public header to "include" for platformio, but I don't currently have test environments for Arduino or ESP-IDF package management to ensure they'll also pick it up. Would it be better to embed it back in the cpp file?

Rebased coalesce support is rebased on replace-queue-coalesce. I tried three different locking approaches: a "double lock" approach where it locks to check for coalescing, releases it to malloc, and again to enqueue; a "lock over whole handler" approach (tip default), and a "just use the LwIP lock" (CONFIG_ASYNC_TCP_QUEUE_LWIP_LOCK) option. My test case results suggested that all of these came with some performance penalty, and the larger lock scopes seemed worse - somewhat surprising as I would've thought the semphore acquire/release would dominate. I have not yet tried using atomics instead of locking for poll and sent; and it may be that recv coalescence isn't worth bothering with (though there's an simplification to be had there in eliminating the need for _free_event).

Is there anything else I should look at?

@mathieucarbou
Copy link
Member

mathieucarbou commented Mar 1, 2025

@willmmiles thank you very much!
Yes if you could please rebase on main, then I will retest the PR carefully and @me-no-dev and @vortigont will need to take time to review it.

Instead of pointing to the chain, point to the event.  Fixes handling
of cases with nonzero error events mid-sequence.

This requires clarifying the _remove_events/invalidate_pcb.
@willmmiles
Copy link
Author

Unfortunately it wasn't really practical to rebase - you've adopted some of the fixes directly, and merging via rebase was just a nightmare of conflicts. This branch does too much all at once to keep up with the ongoing development and maintaince in the main line given my time limitations. I did some manual merges in my development line with coalescing support, which were painful enough that I'm not sure if they're worth repeating.

For the sake of discussion, I've pushed the full coalescence development line here in this draft for discussion. The line now has several different approaches to coalescing:

  • "double locking" -- taking the queue lock, coalescing if possible; otherwise releasing the queue lock and going through the existing logic. Advantage is the queue lock is held for the smallest amount of time; disadvantage is needing to lock twice on the new event path.
  • "single locking" -- holding the queue lock over the entire handler scope, resulting in fewer lock operations. Load testing suggests this is a performance loss compared to the double locking solution.
  • "single locking using LwIP lock" -- Dead end. Only really viable on newer platforms. Slower still than explicit single locking.
  • Atomics, viable for poll and sent but not recv. No noticable performance change from "single locking" solution; costs about 250 bytes of code compared to the "single locking" system.

Personally I think either double locking or atomics are the way to go. The value of recv coalescence is IMO somewhat debatable; it might save queuing some events, but adds a risk of unfairness/starvation as the event handlers might end up processing more data at once.

If there's no hurry, what I think might be bet the best way to go is to do a "logical rebase" from main, one key change at a time, to make it easier to review each step in isolation. I'll leave this PR open if there's any value in testing the end state as-is. The basic change sequence I suggest is:

  • Removal of intrusive list
  • Shallow queue replacement with minimal/no logic changes
  • Close slots removal (fixes possible race with re-used slot)
  • end_event preallocation (fixes possible case of unqueuable error event resulting in object leak)
  • Any other safety-related changes and/or validation
  • Additional coalescence, if useful.

Does this make sense?

@vortigont
Copy link

@willmmiles sounds reasonable, indeed the differences with main branch is quite complex and I'm not sure what would be more challenging - merge your changes into main or port upstream changes back to your branch.
As for coalescing - a large field for discussions :) let's take it piece by piece. I'm here to help with merging if you you can split/delegate tasks probably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants