-
Notifications
You must be signed in to change notification settings - Fork 15
Replace queue with linked list #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Replace the bounded queue with a linked list and condvar implementation, and replace the closed_slots system with double indirection via AsyncClient's own memory. This allows the system to correctly handle cases where it is not possible to allocate a new event and still ensure that the client's `onDisconnect()` will be queued for execution.
Wow! Thanks a lot! |
src/AsyncTCP.h
Outdated
#ifndef CONFIG_ASYNC_TCP_MAX_ACK_TIME | ||
#define CONFIG_ASYNC_TCP_MAX_ACK_TIME 5000 | ||
#endif | ||
|
||
#ifndef CONFIG_ASYNCTCP_HAS_INTRUSIVE_LIST | ||
#define CONFIG_ASYNCTCP_HAS_INTRUSIVE_LIST 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the use of that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First porting error - both the name and the implementation are bad. :(
There's a strange feature in the original code: AsyncClient has an intrusive list integrated (the prev/next pointers), but it's unclear as to why or what it's useful for. I removed it in my branch to save RAM as nobody in WLED (AsyncWebServer, AsyncMQTT) actually uses it for anything. I was trying to make it conditional, but default on. Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(default on for backwards compatibility, in case there is someone using it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I don' see any reason to keep this linked list... Do you @me-no-dev ? This is not even used.
@willmmiles : fyi as seen with @me-no-dev we will merge #19 first, do a v3.3.4, then focus on reviewing / merging you PR. |
Sounds good, will do! |
The semantics of this operation were error-prone; remove it to ensure safety.
I had just a short glimpse here, an interesting approach indeed. But also have a lot of things to help me to understand. first thing first - how the list size is controlled there? Does it have any limits or it can grow as much as resources are there? |
This draft makes no attempt to limit the queue length, so long as there's heap available. A strictly fixed size queue is impractical because we must ensure that disconnection events cannot be dropped, or resources will leak. It's possible to add a soft upper limit on non-critical events, but it didn't seem to be worth the extra complexity (or having to explain an arbitrary resource limit that's independent of what the heap can actually service). Implementing event coalescence for poll, recv, and sent events will put a functional upper bound on the queue size as well, based on the number of open connections. The rationale to replace the queue breaks down as:
I'm usually the first to recommend using library implementations for classic data structures; ultimately I judged that the maintenance burden for the limited requirements here was less than the cost of bringing in some external library, given that Replacing the close_slot system is otherwise straightforward (it's nothing more than strict ownership of the pcb* by the LwIP thread), but it is contingent on guaranteed disconnect event delivery. I did originally implement these as separate changes, but since it wasn't going to merge cleanly with the development line from my old fork, I judged it wasn't worth trying to break it down in to separate commits. |
@willmmiles : fyi asynctcp is released and we'll do an asyncwebserver release tomorrow with the current v3.3.5 of asynctcp. So we are good to refresh this PR and have time to reviewed / merged. When i tested with autocannon:
I didn't test the client part yet. Will do later. |
No reason for this to require a function call.
Use new(std::nothrow) instead.
If any of the TCP callbacks fails to allocate, return ERR_MEM.
src/AsyncTCP.cpp
Outdated
tcp_recv(pcb, &_tcp_recv); | ||
tcp_sent(pcb, &_tcp_sent); | ||
tcp_err(pcb, &_tcp_error); | ||
tcp_poll(pcb, &_tcp_poll, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use a constant here ? it was CONFIG_ASYNC_TCP_POLL_TIMER
before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# 2 on the merge failure list!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 2321755
tcpip_api_call(_tcp_connect_api, (struct tcpip_api_call_data *)&msg); | ||
return msg.err == ESP_OK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is an issue with the client part which is not working (testing with the Client.ino).
connect()
returns false from here.
I've changed the code to get the error:
tcpip_api_call(_tcp_connect_api, (struct tcpip_api_call_data *)&msg);
if(msg.err != ERR_OK) {
log_e("tcpip_api_call error: %d", msg.err);
}
return msg.err == ESP_OK;
[ 1305][E][AsyncTCP.cpp:791] _connect(): tcpip_api_call error: -16
-16 is the invalid arg error.
Hope that helps!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be using the config setting.
We can store the newly allocated handle directly in the object member.
We don't plan on writing to it - might as well save the copying.
I've put a prototype event coalescing branch at https://github.com/willmmiles/AsyncTCP/tree/replace-queue-coalesce . My basic tests seem to work, but I don't yet have a good test case to really exercise the new logic. |
@willmmiles agreed to most of your points above, with proper coalescing code the events chain size cap won't be that critical. And the benefits of correctness does worth the efforts on proper locking code. Good job indeed! I'm not sure I understood your point on why |
Hi folks! Just trying to understand the issues that this PR is hoping to fix. Is it just the fact that events might be missed, because Queue is overflown, or are there also other things that are being fixed also? Please excuse my ignorance. I've been out of the game for a bit :) |
Here are the problematic behaviour that we currently have and that this implementation seems to solve
Plus all the valid points Will explained above. I think it makes sense to inclue his changes following the many efforts he did in stabilising this library in the case of WLED. @willmmiles : I will retest your changes and the replace-queue-coalesce branch also, we have 2 use cases. |
This simplies the queue operations.
I've found the race: if the async thread is in the middle of processing an event that results in destruction of the AsyncClient (fin, or a sent that completes a transaction) when an a |
For clarity of implementation
Eliminates any possibility that a client may have its end event queued twice. LwIP should prevent this anyways, but it never hurts to be safe.
src/AsyncTCP.cpp
Outdated
// The associated pcb is now invalid and will soon be deallocated | ||
// We call on the preallocated end event from the client object | ||
lwip_tcp_event_packet_t *e = AsyncClient_detail::invalidate_pcb(*client); | ||
assert(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could e be null here if client._end_event is null following a bad allocation ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_tcp_error
isn't bound to the pcb unless the end_event
could be allocated.
Ensure that the discard event is run once and exactly once for each client.
It's unclear why this might be necessary - LwIP oughtn't be triggering new callbacks after the error callback; but just to be on the safe side, make sure we have no dangling references in the pcb we are releasing. Related to the changes in me-no-dev#31.
The existing TCP_MUTEX_(UN)LOCK macros are not reentrant nor safe to call on code that might get run on the LwIP thread. Convert to a C++ lock_guard-style object, which is safer. cleans up some code paths, and correctly supports AsyncClient's constructor.
This gives access to AsyncClient internals, preparing for O(1) coalescing.
This is preparing to expand the guard scope to include other client object members for event coaelscense.
If one of these events is pending on a client, subsequent callbacks can be safely rolled together in to one event.
It's redundant in this design; we block forever
Add a feature to use the LwIP lock for the queue lock. This reduces the load on the LwIP thread, but might have the async thread wait longer.
Saves a little space when CONFIG_ASYNC_TCP_QUEUE_LWIP_LOCK is set
This seems to be properly stable for me, now - the race is resolved and I've put in extra checking with the discard callback to catch any as yet unconsidered cases for double-free. I've also merged an equivalent to #31, to be on the safe side; as I understood the docs, LwIP shouldn't trigger any more callbacks after I pulled the queue code out to a separate file to make it easier to inspect. I actually tried a version that used Rebased coalesce support is rebased on replace-queue-coalesce. I tried three different locking approaches: a "double lock" approach where it locks to check for coalescing, releases it to malloc, and again to enqueue; a "lock over whole handler" approach (tip default), and a "just use the LwIP lock" ( Is there anything else I should look at? |
@willmmiles thank you very much! |
Instead of pointing to the chain, point to the event. Fixes handling of cases with nonzero error events mid-sequence. This requires clarifying the _remove_events/invalidate_pcb.
Unfortunately it wasn't really practical to rebase - you've adopted some of the fixes directly, and merging via rebase was just a nightmare of conflicts. This branch does too much all at once to keep up with the ongoing development and maintaince in the main line given my time limitations. I did some manual merges in my development line with coalescing support, which were painful enough that I'm not sure if they're worth repeating. For the sake of discussion, I've pushed the full coalescence development line here in this draft for discussion. The line now has several different approaches to coalescing:
Personally I think either double locking or atomics are the way to go. The value of If there's no hurry, what I think might be bet the best way to go is to do a "logical rebase" from main, one key change at a time, to make it easier to review each step in isolation. I'll leave this PR open if there's any value in testing the end state as-is. The basic change sequence I suggest is:
Does this make sense? |
@willmmiles sounds reasonable, indeed the differences with main branch is quite complex and I'm not sure what would be more challenging - merge your changes into main or port upstream changes back to your branch. |
Replace the bounded queue with a linked list and "condvar" implementation, and replace the closed_slots system with double indirection via AsyncClient's own memory. This allows the system to correctly handle cases where it is not possible to allocate a new event while guaranteeing that the client's
onDisconnect()
will be run to free up any other related resources.Key changes:
CONFIG_ASYNC_TCP_QUEUE_SIZE
queue size limit;CONFIG_ASYNCTCP_HAS_INTRUSIVE_LIST
AsyncClient
's own_pcb
member. Once initialized, this member is written only by the LwIP thread, preventing most races; and theAsyncClient
object itself is never deleted on the LwIP thread.This draft rebases the changes from willmmiles/AsyncTCP:master to this development line. As this project is moving faster than I can keep up with, in the interests of making this code available for review sooner, I have performed only minimal testing on this port. It is likely there is a porting error somewhere.
Known issues as of this writing:
AsyncClient::operator=(const AsyncClient&)
assignment operator is removed. The old code had implemented this operation with an unsafe partial move sematic, leaving the old object holding a dangling reference to the pcb. It's not clear to me what should be done here - copies of AsyncClient are not generally meaningful.AsyncClient
is addressed from a third task (eg. from an Arduinoloop()
, not LwIP or the async task). The fact that LwIP reserves the right to invalidate tcp_pcbs on its thread at any time aftertcp_err
makes this extremely challenging to get both strictly correct and performant. Core operations that pass to the LwIP thread are safe, but I think state reads (state()
,space()
, etc.) are still risky._end_event
can ignore the limit to ensureonDisconnect
gets run.Future work:
lwip_tcp_event_packet_t::dns
should be unbundled to a separate event object type from the rest of thelwip_tcp_event_packet_t
variants. It's easily twice the size of any of the others; this will reduce the memory footprint for normal operation.recv
andsend
also permit sensible aggregation.