Cellular: Add check for network congestion in BC95 driver #12083

AriParkkila · 2019-12-11T13:05:54Z

Summary of changes

Add check for network congestion in BC95 driver to return WOULDBLOCK on socket send.

Impact of changes

Migration actions required

Documentation

Pull request type

[x] Patch update (Bug fix / Target update / Docs update / Test update / Refactor)
[] Feature update (New feature / Functionality change / New API)
[] Major update (Breaking change E.g. Return code change / API behaviour change)

Test results

[x] No Tests required for this change (E.g docs only update)
[] Covered by existing mbed-os tests (Greentea or Unittest)
[] Tests / results supplied as part of this PR

Reviewers

ciarmcom · 2019-12-11T14:00:30Z

@AriParkkila, thank you for your changes.
@ARMmbed/mbed-os-wan @ARMmbed/mbed-os-maintainers please review.

kjbracey · 2019-12-11T14:10:32Z

Will the driver generate a SIGIO event when the congestion is over? That's a necessary part of the contract for "would block".

(A purely timed event would do, if the hardware gives you nothing, but you have to generate something.)

It this being aimed at UDP or TCP or both? Generating "would block" at all for UDP is weird, so I wouldn't recommend it unless it's solving a specific UDP problem. You may want to make this conditional on TCP.

AriParkkila · 2019-12-16T08:44:40Z

Added a timed callback event after NSAPI_ERROR_WOULD_BLOCK on both TCP and UDP.

kjbracey

Timeout stuff looks correct, just can be simplified a bit.

What about UDP? Are you confident you need this to generate "would block" for that case? I'm nervous that this may cause problems because it's so unconventional - neither lwIP nor Nanostack would ever do so, and I can't think of any offboard driver that does.

They would normally just drop the packet either with no error or a "no memory" error.

I would only consider a "wouldblock" return for UDP if I was seeing real problems with 2 or 3 packets in a row consistently dropping the second - that behaviour may cause more application problems than returning the "would block".

Typical UDP behaviour that applications tend to assume is:

UDP send never blocks, even if socket is blocking
UDP will drop packets if congestion occurs
You can send a few back-to-back packets without immediately causing congestion loss

Returning "would block" makes UDP send block, unless the socket is configured non-blocking - a send call could end up waiting indefinitely until the modem accepts the packet.

If I'd only consider violating 1 if the alternative was violating 3 a lot, and I was confident it wouldn't be blocking for long - we were just covering over some rapidly-clearing local buffer congestion.

There is a "DTLS" socket test somewhere in Greentea or IceTea that is exercising a typical DTLS-like 5-packet burst. If that test passes without this, I'd say make it conditional on TCP.

kjbracey · 2019-12-16T08:58:20Z

features/cellular/framework/targets/QUECTEL/BC95/QUECTEL_BC95_CellularStack.cpp

+    device_err_t err = _at.get_last_device_error();
+    if (err.errType == DeviceErrorTypeErrorCME && (err.errCode == AT_UART_BUFFER_ERROR || err.errCode == AT_BACK_OFF_TIMER)) {
+        uint8_t expect_false = false;
+        if (core_util_atomic_cas_u8(&socket->txfull_event, &expect_false, true) && !_txfull_event_id) {


Not sure you need the atomic, but if you do, using if (!core_util_atomic_exchange_bool(&socket->txfull_event, true)) is simpler and smaller.

You may see that CAS form around, but it predates the addition of the "exchange" call and the "bool" form.

@kjbracey-arm For example UDPSOCKET_ECHOTEST_BURST_NONBLOCK test handles NSAPI_ERROR_WOULD_BLOCK so it should be expected return value also from UDP sockets or we need to fix UDP socket tests as well.

@kjbracey-arm Fixed UDP to non-block and simplified atomic calls. Looks good?

kjbracey · 2019-12-16T12:50:01Z

@kjbracey-arm For example UDPSOCKET_ECHOTEST_BURST_NONBLOCK test handles NSAPI_ERROR_WOULD_BLOCK so it should be expected return value also from UDP sockets or we need to fix UDP socket tests as well.

That test code is rather unusual. Normal UDP applications will tend to not have special wouldblock handling. The handling as shown is also very suboptimal - causing a tight 100% CPU spin loop until the TX works. (I note it was you who added the test handling, so it seems like a slight case of writing drivers in a particular way and then adjusting tests to match that way, rather than writing drivers to match existing apps).

The same "wait until TX" effect would be better achieved by marking the socket blocking for the TX, and non-blocking for the RX.

Given the test change, is it the case that you have cellular drivers that do routinely report an error during that 5-packet TX burst, so fail it? The test seems to be aiming for 70% success - the burst should mostly work. Not achievable without a block/retry?

The reason for the test is to try to simulate what apps have been doing - in particular DTLS doing a 5-packet burst on a non-blocking socket. If the test becomes non-representative of apps, it no longer serves its purpose.

If your modem drivers can't pass that test directly, I'd like to solve it in a way invisible to applications. That could be modifying UDPSocket itself to ignore the "non-blocking" flag on send, and always act as blocking. Sounds weird, but makes sense - normal drivers and stacks will never return "wouldblock", so that would have no effect for them. But it would have an effect for offboard-modem-type drivers returning it just to make themselves not fail this test. The assumption would be that any UDP driver would only be returning "wouldblock" very briefly, and it wouldn't be subject to deadlock in event loop context. I think that assumption is sound - we're always talking about offboard stacks doing it, I think.

Alternatively, your driver could do it's own delay and retry, but the UDPSocket trick seems neater to me. Although on the other hand, it forces you to implement the timed SIGIO - a local "attempt send 3 times with a sleep(50) between each" would be more straightforward.

In both cases, even though the call is supposed to be "non-blocking", taking a small amount of time to retry (couple of 100ms?) is not out-of-line with the time you take to do any socket operation, due to the serial speed. Waiting for a local TX buffer to clear is not the same as waiting for a packet to arrive or waiting for TCP flow control - it's similar to any "wait for modem AT response".

kjbracey · 2019-12-17T11:24:47Z

I've thought about this a bit more. The "just block anyway in UDPSocket" won't fly - there could still be deadlock in the generation of SIGIO. If the caller's on the global event queue, the sigio event won't come.

I think I'm favouring the second solution - just synchronously retry a couple of times in the driver. That keeps the inherent simplicity of the offboard driver, not adding any extra context complexity.

If you get a "busy" response, and it's UDP, sleep (maybe 20-50ms? check what is likely to clear it?), and retry, once or twice. I've already seen that pattern used in a few AT-type drivers.

As long as the modem's busy does clear within 100ms, that should be sufficient to cover the "lack of modem buffer" issue for normally-written non-blocking UDP apps.

AriParkkila · 2019-12-17T11:38:16Z

@kjbracey-arm Do you mean that handling of NSAPI_ERROR_WOULD_BLOCK should be removed from netsocket/udp/ tests completely?

kjbracey · 2019-12-17T11:52:55Z

@kjbracey-arm Do you mean that handling of NSAPI_ERROR_WOULD_BLOCK should be removed from netsocket/udp/ tests completely?

Basically, yes, if the test is using it to trigger an immediate or timed UDP retry, as that's not representative of typical UDP application behaviour.

It those checks needed to be added to get the modems to pass the tests, then those modems are likely to still cause problems in real applications.

If we can pass the test with a synchronous retry or two in the driver, then that's ideal.

But be clear the driver retry is only for UDP - we assume (hopefully correctly) that a UDP "busy" is always going to be "rapidly clearing" in the modem itself, so we tolerate the delay just to cope with bursty non-blocking apps.

For TCP a "busy" is an actual end-to-end flow control signal, so there's no reason to assume it will clear rapidly. Apps will also be assuming blocking or "wouldblock", so will be quite happy for you to immediately say "wouldblock" (as long as you SIGIO later).

AriParkkila · 2019-12-18T05:57:49Z

@kjbracey-arm What is a proper error return value when UDP fails to send (still after retries). Mbed OS UDP socket tests must be updated to handle it instead of NSAPI_ERROR_WOULD_BLOCK.

kjbracey · 2019-12-18T07:34:08Z

@kjbracey-arm What is a proper error return value when UDP fails to send (still after retries). Mbed OS UDP socket tests must be updated to handle it instead of NSAPI_ERROR_WOULD_BLOCK.

I would say "no memory" (@SeppoTakalo, @VeijoPesonen - am I right?), but I would not expect any test to "handle" it specifically. UDP senders shouldn't normally bother checking returns from send, except perhaps "no socket" indicating the whole socket has died somehow.

If the driver drops the packet, the packet is just treated as dropped (due to congestion) and it should be handled like any packet drop along the path - a timed retry in the app. Packets can be lost along the path with or without visible error code, so checking for the error code doesn't help.

Basic way the Internet works at datagram level is basically to assume it's a mesh of Ethernet-type networks attached by routers. Packet loss (determined by lack of ack) is assumed to be by congestion, and handled by retry with increasing backoff at the sender. Links are assumed to be >95% reliability when uncongested, with at least a few packets worth of buffer space so 2 packets in a row doesn't cause immediate congestion - Ethernet-like. (IP datagrams are just abstracted Ethernet frames tbh)

If any particular link can't achieve Ethernet-like levels of reliability, then to preserve the apparent performance characteristics for applications they need to do a certain amount of local link retry - eg 802.15.4 radios do 5 or so retries with MAC-level acks before giving up.

Here it seems we may have a modem with an unusually low amount of UDP buffer (or none), so the driver has to cover for that. It could do some asynchronous buffering itself, but forcibly slowing the app with the synchronous retry is acceptable and simpler.

It's not really acceptable for the app to handle that sort of link-specific retry, because it doesn't know the characteristics of the link, so wouldn't know how many retries or what timing would be appropriate, or if on a given link the retries might make the problem worse. Apps should be conservative and assume any packet loss is congestion, so is a cue to slow down - just let the "no ack" retry code handle it, and ignore the send return code.

(And all of that is what TCP does internally too - TCP doesn't have any handling for TX errors for its segments from the driver).

adbridge · 2019-12-18T13:27:28Z

@kjbracey-arm could you please review the updates ?

kjbracey

This looks correct to me now. But does it work? (That's quite a distinct thing from "looking correct to Kevin"...)

Only thing is that now I can see the atomic bits more clearly there's a flaw there.

kjbracey · 2019-12-19T08:07:48Z

features/cellular/framework/targets/QUECTEL/BC95/QUECTEL_BC95_CellularStack.cpp

+            }
+            return NSAPI_ERROR_NO_MEMORY;
+        }
+        if (!core_util_atomic_exchange_bool(&socket->txfull_event, true) && !_txfull_event_id) {


If you need atomic handling of the txfull_event, you also will need atomic handling of the txfull_event_id.

Indeed, that's the more important one, as the call_in has side-effects.

At this point, I'd say keep it simple by protecting both with the _socket_mutex. Any reason not to do that? Much as I like atomics, they're a pain to think about compared to mutexes.

Atomic implementation is a kind of copy from ESP8266Interface. I'm not aware of any call_in side effects, but if it the queue guarantees that a new event is always generated after call_in (also while executing the same event) then additional locking is not needed here. However, if you think mutex is more safe for some reason I can change the code.

Discussed offline - short answer is that multiple call_ins are fine, it's that this could lose track of what it's queued, so the clean-up attempt in the destructor becomes ineffective.

(Plus the unprotected multithread access to the variable is formally undefined).

AriParkkila · 2019-12-20T12:46:38Z

@kjbracey-arm Please review, changed atomic to mutex and UDP to return NO_MEMORY instead of WOULDBLOCK.

kjbracey · 2019-12-20T12:55:43Z

TESTS/netsocket/udp/udpsocket_echotest_burst.cpp

@@ -100,7 +100,9 @@ void UDPSOCKET_ECHOTEST_BURST()
            if (check_oversized_packets(sent, tx_buffers[x].len)) {
                TEST_IGNORE_MESSAGE("This device does not handle oversized packets");
            }
-            TEST_ASSERT_EQUAL(tx_buffers[x].len, sent);
+            if (sent != NSAPI_ERROR_NO_MEMORY) {


I think I'd be even more relaxed on this - I'd let it return any error, or the size sent.

But this is fine, as it's relaxing it somewhat.

(The code handling oversized packets seems wrong to me, but that's another issue).

kjbracey · 2019-12-20T12:56:48Z

features/cellular/framework/AT/AT_CellularStack.h

@@ -132,6 +133,7 @@ class AT_CellularStack : public NetworkStack {
        bool tx_ready; // socket is ready for sending on modem stack
        bool tls_socket; // socket uses modem's internal TLS socket functionality
        nsapi_size_t pending_bytes; // The number of received bytes pending
+        bool txfull_event; // socket event after wouldblock


You could save RAM putting this bool next to connected

kjbracey · 2019-12-23T09:07:39Z

CI started

mbed-ci · 2019-12-23T10:09:44Z

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 1
Build artifacts

mirelachirica approved these changes Dec 11, 2019

View reviewed changes

ciarmcom requested review from a team December 11, 2019 14:00

ciarmcom added the needs: review label Dec 11, 2019

AnttiKauppila approved these changes Dec 11, 2019

View reviewed changes

AriParkkila force-pushed the cell-bc95-congestion branch from f83fe95 to d687c45 Compare December 16, 2019 08:40

kjbracey reviewed Dec 16, 2019

View reviewed changes

AriParkkila force-pushed the cell-bc95-congestion branch from d687c45 to 91a26f1 Compare December 18, 2019 12:47

kjbracey suggested changes Dec 19, 2019

View reviewed changes

Cellular: Add check for network congestion in BC95 driver

d5abab8

AriParkkila force-pushed the cell-bc95-congestion branch from 91a26f1 to d5abab8 Compare December 20, 2019 12:25

TESTS: Allow NSAPI_ERROR_NO_MEMORY on UDPSOCKET_ECHOTEST_BURST

6b37480

kjbracey approved these changes Dec 20, 2019

View reviewed changes

kjbracey mentioned this pull request Dec 23, 2019

ESP8266: Avoid duplicate data sends #12157

Merged

kjbracey added needs: CI and removed needs: review labels Dec 23, 2019

0xc0170 removed the needs: CI label Dec 23, 2019

0xc0170 added the release-version: 6.0.0-alpha-1 First pre-release version of 6.0.0 label Dec 23, 2019

0xc0170 merged commit 1623b14 into ARMmbed:master Dec 23, 2019

Cellular: Add check for network congestion in BC95 driver #12083

Cellular: Add check for network congestion in BC95 driver #12083

Uh oh!

Conversation

AriParkkila commented Dec 11, 2019

Summary of changes

Impact of changes

Migration actions required

Documentation

Pull request type

Test results

Reviewers

Uh oh!

ciarmcom commented Dec 11, 2019

Uh oh!

kjbracey commented Dec 11, 2019

Uh oh!

AriParkkila commented Dec 16, 2019

Uh oh!

kjbracey left a comment

Choose a reason for hiding this comment

Uh oh!

kjbracey Dec 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AriParkkila Dec 16, 2019

Choose a reason for hiding this comment

Uh oh!

AriParkkila Dec 18, 2019

Choose a reason for hiding this comment

Uh oh!

kjbracey commented Dec 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kjbracey commented Dec 17, 2019

Uh oh!

AriParkkila commented Dec 17, 2019

Uh oh!

kjbracey commented Dec 17, 2019

Uh oh!

AriParkkila commented Dec 18, 2019

Uh oh!

kjbracey commented Dec 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adbridge commented Dec 18, 2019

Uh oh!

kjbracey left a comment

Choose a reason for hiding this comment

Uh oh!

kjbracey Dec 19, 2019

Choose a reason for hiding this comment

Uh oh!

AriParkkila Dec 19, 2019

Choose a reason for hiding this comment

Uh oh!

kjbracey Dec 19, 2019

Choose a reason for hiding this comment

Uh oh!

AriParkkila commented Dec 20, 2019

Uh oh!

kjbracey Dec 20, 2019

Choose a reason for hiding this comment

Uh oh!

kjbracey Dec 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kjbracey commented Dec 23, 2019

Uh oh!

mbed-ci commented Dec 23, 2019

Test run: SUCCESS

Uh oh!

Uh oh!

kjbracey Dec 16, 2019 •

edited

Loading

kjbracey commented Dec 16, 2019 •

edited

Loading

kjbracey commented Dec 18, 2019 •

edited

Loading

kjbracey Dec 20, 2019 •

edited

Loading