-
Notifications
You must be signed in to change notification settings - Fork 3k
Cellular: Add check for network congestion in BC95 driver #12083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@AriParkkila, thank you for your changes. |
Will the driver generate a SIGIO event when the congestion is over? That's a necessary part of the contract for "would block". (A purely timed event would do, if the hardware gives you nothing, but you have to generate something.) It this being aimed at UDP or TCP or both? Generating "would block" at all for UDP is weird, so I wouldn't recommend it unless it's solving a specific UDP problem. You may want to make this conditional on TCP. |
f83fe95
to
d687c45
Compare
Added a timed callback event after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout stuff looks correct, just can be simplified a bit.
What about UDP? Are you confident you need this to generate "would block" for that case? I'm nervous that this may cause problems because it's so unconventional - neither lwIP nor Nanostack would ever do so, and I can't think of any offboard driver that does.
They would normally just drop the packet either with no error or a "no memory" error.
I would only consider a "wouldblock" return for UDP if I was seeing real problems with 2 or 3 packets in a row consistently dropping the second - that behaviour may cause more application problems than returning the "would block".
Typical UDP behaviour that applications tend to assume is:
- UDP send never blocks, even if socket is blocking
- UDP will drop packets if congestion occurs
- You can send a few back-to-back packets without immediately causing congestion loss
Returning "would block" makes UDP send block, unless the socket is configured non-blocking - a send call could end up waiting indefinitely until the modem accepts the packet.
If I'd only consider violating 1 if the alternative was violating 3 a lot, and I was confident it wouldn't be blocking for long - we were just covering over some rapidly-clearing local buffer congestion.
There is a "DTLS" socket test somewhere in Greentea or IceTea that is exercising a typical DTLS-like 5-packet burst. If that test passes without this, I'd say make it conditional on TCP.
device_err_t err = _at.get_last_device_error(); | ||
if (err.errType == DeviceErrorTypeErrorCME && (err.errCode == AT_UART_BUFFER_ERROR || err.errCode == AT_BACK_OFF_TIMER)) { | ||
uint8_t expect_false = false; | ||
if (core_util_atomic_cas_u8(&socket->txfull_event, &expect_false, true) && !_txfull_event_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure you need the atomic, but if you do, using if (!core_util_atomic_exchange_bool(&socket->txfull_event, true))
is simpler and smaller.
You may see that CAS form around, but it predates the addition of the "exchange" call and the "bool" form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kjbracey-arm For example UDPSOCKET_ECHOTEST_BURST_NONBLOCK test handles NSAPI_ERROR_WOULD_BLOCK
so it should be expected return value also from UDP sockets or we need to fix UDP socket tests as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kjbracey-arm Fixed UDP to non-block and simplified atomic calls. Looks good?
That test code is rather unusual. Normal UDP applications will tend to not have special wouldblock handling. The handling as shown is also very suboptimal - causing a tight 100% CPU spin loop until the TX works. (I note it was you who added the test handling, so it seems like a slight case of writing drivers in a particular way and then adjusting tests to match that way, rather than writing drivers to match existing apps). The same "wait until TX" effect would be better achieved by marking the socket blocking for the TX, and non-blocking for the RX. Given the test change, is it the case that you have cellular drivers that do routinely report an error during that 5-packet TX burst, so fail it? The test seems to be aiming for 70% success - the burst should mostly work. Not achievable without a block/retry? The reason for the test is to try to simulate what apps have been doing - in particular DTLS doing a 5-packet burst on a non-blocking socket. If the test becomes non-representative of apps, it no longer serves its purpose. If your modem drivers can't pass that test directly, I'd like to solve it in a way invisible to applications. That could be modifying Alternatively, your driver could do it's own delay and retry, but the UDPSocket trick seems neater to me. Although on the other hand, it forces you to implement the timed SIGIO - a local "attempt send 3 times with a sleep(50) between each" would be more straightforward. In both cases, even though the call is supposed to be "non-blocking", taking a small amount of time to retry (couple of 100ms?) is not out-of-line with the time you take to do any socket operation, due to the serial speed. Waiting for a local TX buffer to clear is not the same as waiting for a packet to arrive or waiting for TCP flow control - it's similar to any "wait for modem AT response". |
I've thought about this a bit more. The "just block anyway in UDPSocket" won't fly - there could still be deadlock in the generation of SIGIO. If the caller's on the global event queue, the sigio event won't come. I think I'm favouring the second solution - just synchronously retry a couple of times in the driver. That keeps the inherent simplicity of the offboard driver, not adding any extra context complexity. If you get a "busy" response, and it's UDP, sleep (maybe 20-50ms? check what is likely to clear it?), and retry, once or twice. I've already seen that pattern used in a few AT-type drivers. As long as the modem's busy does clear within 100ms, that should be sufficient to cover the "lack of modem buffer" issue for normally-written non-blocking UDP apps. |
@kjbracey-arm Do you mean that handling of |
Basically, yes, if the test is using it to trigger an immediate or timed UDP retry, as that's not representative of typical UDP application behaviour. It those checks needed to be added to get the modems to pass the tests, then those modems are likely to still cause problems in real applications. If we can pass the test with a synchronous retry or two in the driver, then that's ideal. But be clear the driver retry is only for UDP - we assume (hopefully correctly) that a UDP "busy" is always going to be "rapidly clearing" in the modem itself, so we tolerate the delay just to cope with bursty non-blocking apps. For TCP a "busy" is an actual end-to-end flow control signal, so there's no reason to assume it will clear rapidly. Apps will also be assuming blocking or "wouldblock", so will be quite happy for you to immediately say "wouldblock" (as long as you SIGIO later). |
@kjbracey-arm What is a proper error return value when UDP fails to send (still after retries). Mbed OS UDP socket tests must be updated to handle it instead of |
I would say "no memory" (@SeppoTakalo, @VeijoPesonen - am I right?), but I would not expect any test to "handle" it specifically. UDP senders shouldn't normally bother checking returns from send, except perhaps "no socket" indicating the whole socket has died somehow. If the driver drops the packet, the packet is just treated as dropped (due to congestion) and it should be handled like any packet drop along the path - a timed retry in the app. Packets can be lost along the path with or without visible error code, so checking for the error code doesn't help. Basic way the Internet works at datagram level is basically to assume it's a mesh of Ethernet-type networks attached by routers. Packet loss (determined by lack of ack) is assumed to be by congestion, and handled by retry with increasing backoff at the sender. Links are assumed to be >95% reliability when uncongested, with at least a few packets worth of buffer space so 2 packets in a row doesn't cause immediate congestion - Ethernet-like. (IP datagrams are just abstracted Ethernet frames tbh) If any particular link can't achieve Ethernet-like levels of reliability, then to preserve the apparent performance characteristics for applications they need to do a certain amount of local link retry - eg 802.15.4 radios do 5 or so retries with MAC-level acks before giving up. Here it seems we may have a modem with an unusually low amount of UDP buffer (or none), so the driver has to cover for that. It could do some asynchronous buffering itself, but forcibly slowing the app with the synchronous retry is acceptable and simpler. It's not really acceptable for the app to handle that sort of link-specific retry, because it doesn't know the characteristics of the link, so wouldn't know how many retries or what timing would be appropriate, or if on a given link the retries might make the problem worse. Apps should be conservative and assume any packet loss is congestion, so is a cue to slow down - just let the "no ack" retry code handle it, and ignore the send return code. (And all of that is what TCP does internally too - TCP doesn't have any handling for TX errors for its segments from the driver). |
d687c45
to
91a26f1
Compare
@kjbracey-arm could you please review the updates ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct to me now. But does it work? (That's quite a distinct thing from "looking correct to Kevin"...)
Only thing is that now I can see the atomic bits more clearly there's a flaw there.
} | ||
return NSAPI_ERROR_NO_MEMORY; | ||
} | ||
if (!core_util_atomic_exchange_bool(&socket->txfull_event, true) && !_txfull_event_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need atomic handling of the txfull_event
, you also will need atomic handling of the txfull_event_id
.
Indeed, that's the more important one, as the call_in
has side-effects.
At this point, I'd say keep it simple by protecting both with the _socket_mutex
. Any reason not to do that? Much as I like atomics, they're a pain to think about compared to mutexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Atomic implementation is a kind of copy from ESP8266Interface. I'm not aware of any call_in
side effects, but if it the queue guarantees that a new event is always generated after call_in
(also while executing the same event) then additional locking is not needed here. However, if you think mutex is more safe for some reason I can change the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline - short answer is that multiple call_in
s are fine, it's that this could lose track of what it's queued, so the clean-up attempt in the destructor becomes ineffective.
(Plus the unprotected multithread access to the variable is formally undefined).
91a26f1
to
d5abab8
Compare
@kjbracey-arm Please review, changed atomic to mutex and UDP to return NO_MEMORY instead of WOULDBLOCK. |
@@ -100,7 +100,9 @@ void UDPSOCKET_ECHOTEST_BURST() | |||
if (check_oversized_packets(sent, tx_buffers[x].len)) { | |||
TEST_IGNORE_MESSAGE("This device does not handle oversized packets"); | |||
} | |||
TEST_ASSERT_EQUAL(tx_buffers[x].len, sent); | |||
if (sent != NSAPI_ERROR_NO_MEMORY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd be even more relaxed on this - I'd let it return any error, or the size sent.
But this is fine, as it's relaxing it somewhat.
(The code handling oversized packets seems wrong to me, but that's another issue).
@@ -132,6 +133,7 @@ class AT_CellularStack : public NetworkStack { | |||
bool tx_ready; // socket is ready for sending on modem stack | |||
bool tls_socket; // socket uses modem's internal TLS socket functionality | |||
nsapi_size_t pending_bytes; // The number of received bytes pending | |||
bool txfull_event; // socket event after wouldblock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could save RAM putting this bool
next to connected
CI started |
Test run: SUCCESSSummary: 11 of 11 test jobs passed |
Summary of changes
Add check for network congestion in BC95 driver to return WOULDBLOCK on socket send.
Impact of changes
Migration actions required
Documentation
Pull request type
Test results
Reviewers