Skip to content

Cellular: Add check for network congestion in BC95 driver #12083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 23, 2019

Conversation

AriParkkila
Copy link

Summary of changes

Add check for network congestion in BC95 driver to return WOULDBLOCK on socket send.

Impact of changes

Migration actions required

Documentation


Pull request type

[x] Patch update (Bug fix / Target update / Docs update / Test update / Refactor)
[] Feature update (New feature / Functionality change / New API)
[] Major update (Breaking change E.g. Return code change / API behaviour change)

Test results

[x] No Tests required for this change (E.g docs only update)
[] Covered by existing mbed-os tests (Greentea or Unittest)
[] Tests / results supplied as part of this PR

Reviewers


@ciarmcom ciarmcom requested review from a team December 11, 2019 14:00
@ciarmcom
Copy link
Member

@AriParkkila, thank you for your changes.
@ARMmbed/mbed-os-wan @ARMmbed/mbed-os-maintainers please review.

@kjbracey
Copy link
Contributor

Will the driver generate a SIGIO event when the congestion is over? That's a necessary part of the contract for "would block".

(A purely timed event would do, if the hardware gives you nothing, but you have to generate something.)

It this being aimed at UDP or TCP or both? Generating "would block" at all for UDP is weird, so I wouldn't recommend it unless it's solving a specific UDP problem. You may want to make this conditional on TCP.

@AriParkkila
Copy link
Author

Added a timed callback event after NSAPI_ERROR_WOULD_BLOCK on both TCP and UDP.

Copy link
Contributor

@kjbracey kjbracey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout stuff looks correct, just can be simplified a bit.

What about UDP? Are you confident you need this to generate "would block" for that case? I'm nervous that this may cause problems because it's so unconventional - neither lwIP nor Nanostack would ever do so, and I can't think of any offboard driver that does.

They would normally just drop the packet either with no error or a "no memory" error.

I would only consider a "wouldblock" return for UDP if I was seeing real problems with 2 or 3 packets in a row consistently dropping the second - that behaviour may cause more application problems than returning the "would block".

Typical UDP behaviour that applications tend to assume is:

  1. UDP send never blocks, even if socket is blocking
  2. UDP will drop packets if congestion occurs
  3. You can send a few back-to-back packets without immediately causing congestion loss

Returning "would block" makes UDP send block, unless the socket is configured non-blocking - a send call could end up waiting indefinitely until the modem accepts the packet.

If I'd only consider violating 1 if the alternative was violating 3 a lot, and I was confident it wouldn't be blocking for long - we were just covering over some rapidly-clearing local buffer congestion.

There is a "DTLS" socket test somewhere in Greentea or IceTea that is exercising a typical DTLS-like 5-packet burst. If that test passes without this, I'd say make it conditional on TCP.

device_err_t err = _at.get_last_device_error();
if (err.errType == DeviceErrorTypeErrorCME && (err.errCode == AT_UART_BUFFER_ERROR || err.errCode == AT_BACK_OFF_TIMER)) {
uint8_t expect_false = false;
if (core_util_atomic_cas_u8(&socket->txfull_event, &expect_false, true) && !_txfull_event_id) {
Copy link
Contributor

@kjbracey kjbracey Dec 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you need the atomic, but if you do, using if (!core_util_atomic_exchange_bool(&socket->txfull_event, true)) is simpler and smaller.

You may see that CAS form around, but it predates the addition of the "exchange" call and the "bool" form.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kjbracey-arm For example UDPSOCKET_ECHOTEST_BURST_NONBLOCK test handles NSAPI_ERROR_WOULD_BLOCK so it should be expected return value also from UDP sockets or we need to fix UDP socket tests as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kjbracey-arm Fixed UDP to non-block and simplified atomic calls. Looks good?

@kjbracey
Copy link
Contributor

kjbracey commented Dec 16, 2019

@kjbracey-arm For example UDPSOCKET_ECHOTEST_BURST_NONBLOCK test handles NSAPI_ERROR_WOULD_BLOCK so it should be expected return value also from UDP sockets or we need to fix UDP socket tests as well.

That test code is rather unusual. Normal UDP applications will tend to not have special wouldblock handling. The handling as shown is also very suboptimal - causing a tight 100% CPU spin loop until the TX works. (I note it was you who added the test handling, so it seems like a slight case of writing drivers in a particular way and then adjusting tests to match that way, rather than writing drivers to match existing apps).

The same "wait until TX" effect would be better achieved by marking the socket blocking for the TX, and non-blocking for the RX.

Given the test change, is it the case that you have cellular drivers that do routinely report an error during that 5-packet TX burst, so fail it? The test seems to be aiming for 70% success - the burst should mostly work. Not achievable without a block/retry?

The reason for the test is to try to simulate what apps have been doing - in particular DTLS doing a 5-packet burst on a non-blocking socket. If the test becomes non-representative of apps, it no longer serves its purpose.

If your modem drivers can't pass that test directly, I'd like to solve it in a way invisible to applications. That could be modifying UDPSocket itself to ignore the "non-blocking" flag on send, and always act as blocking. Sounds weird, but makes sense - normal drivers and stacks will never return "wouldblock", so that would have no effect for them. But it would have an effect for offboard-modem-type drivers returning it just to make themselves not fail this test. The assumption would be that any UDP driver would only be returning "wouldblock" very briefly, and it wouldn't be subject to deadlock in event loop context. I think that assumption is sound - we're always talking about offboard stacks doing it, I think.

Alternatively, your driver could do it's own delay and retry, but the UDPSocket trick seems neater to me. Although on the other hand, it forces you to implement the timed SIGIO - a local "attempt send 3 times with a sleep(50) between each" would be more straightforward.

In both cases, even though the call is supposed to be "non-blocking", taking a small amount of time to retry (couple of 100ms?) is not out-of-line with the time you take to do any socket operation, due to the serial speed. Waiting for a local TX buffer to clear is not the same as waiting for a packet to arrive or waiting for TCP flow control - it's similar to any "wait for modem AT response".

@kjbracey
Copy link
Contributor

I've thought about this a bit more. The "just block anyway in UDPSocket" won't fly - there could still be deadlock in the generation of SIGIO. If the caller's on the global event queue, the sigio event won't come.

I think I'm favouring the second solution - just synchronously retry a couple of times in the driver. That keeps the inherent simplicity of the offboard driver, not adding any extra context complexity.

If you get a "busy" response, and it's UDP, sleep (maybe 20-50ms? check what is likely to clear it?), and retry, once or twice. I've already seen that pattern used in a few AT-type drivers.

As long as the modem's busy does clear within 100ms, that should be sufficient to cover the "lack of modem buffer" issue for normally-written non-blocking UDP apps.

@AriParkkila
Copy link
Author

@kjbracey-arm Do you mean that handling of NSAPI_ERROR_WOULD_BLOCK should be removed from netsocket/udp/ tests completely?

@kjbracey
Copy link
Contributor

@kjbracey-arm Do you mean that handling of NSAPI_ERROR_WOULD_BLOCK should be removed from netsocket/udp/ tests completely?

Basically, yes, if the test is using it to trigger an immediate or timed UDP retry, as that's not representative of typical UDP application behaviour.

It those checks needed to be added to get the modems to pass the tests, then those modems are likely to still cause problems in real applications.

If we can pass the test with a synchronous retry or two in the driver, then that's ideal.

But be clear the driver retry is only for UDP - we assume (hopefully correctly) that a UDP "busy" is always going to be "rapidly clearing" in the modem itself, so we tolerate the delay just to cope with bursty non-blocking apps.

For TCP a "busy" is an actual end-to-end flow control signal, so there's no reason to assume it will clear rapidly. Apps will also be assuming blocking or "wouldblock", so will be quite happy for you to immediately say "wouldblock" (as long as you SIGIO later).

@AriParkkila
Copy link
Author

@kjbracey-arm What is a proper error return value when UDP fails to send (still after retries). Mbed OS UDP socket tests must be updated to handle it instead of NSAPI_ERROR_WOULD_BLOCK.

@kjbracey
Copy link
Contributor

kjbracey commented Dec 18, 2019

@kjbracey-arm What is a proper error return value when UDP fails to send (still after retries). Mbed OS UDP socket tests must be updated to handle it instead of NSAPI_ERROR_WOULD_BLOCK.

I would say "no memory" (@SeppoTakalo, @VeijoPesonen - am I right?), but I would not expect any test to "handle" it specifically. UDP senders shouldn't normally bother checking returns from send, except perhaps "no socket" indicating the whole socket has died somehow.

If the driver drops the packet, the packet is just treated as dropped (due to congestion) and it should be handled like any packet drop along the path - a timed retry in the app. Packets can be lost along the path with or without visible error code, so checking for the error code doesn't help.

Basic way the Internet works at datagram level is basically to assume it's a mesh of Ethernet-type networks attached by routers. Packet loss (determined by lack of ack) is assumed to be by congestion, and handled by retry with increasing backoff at the sender. Links are assumed to be >95% reliability when uncongested, with at least a few packets worth of buffer space so 2 packets in a row doesn't cause immediate congestion - Ethernet-like. (IP datagrams are just abstracted Ethernet frames tbh)

If any particular link can't achieve Ethernet-like levels of reliability, then to preserve the apparent performance characteristics for applications they need to do a certain amount of local link retry - eg 802.15.4 radios do 5 or so retries with MAC-level acks before giving up.

Here it seems we may have a modem with an unusually low amount of UDP buffer (or none), so the driver has to cover for that. It could do some asynchronous buffering itself, but forcibly slowing the app with the synchronous retry is acceptable and simpler.

It's not really acceptable for the app to handle that sort of link-specific retry, because it doesn't know the characteristics of the link, so wouldn't know how many retries or what timing would be appropriate, or if on a given link the retries might make the problem worse. Apps should be conservative and assume any packet loss is congestion, so is a cue to slow down - just let the "no ack" retry code handle it, and ignore the send return code.

(And all of that is what TCP does internally too - TCP doesn't have any handling for TX errors for its segments from the driver).

@adbridge
Copy link
Contributor

@kjbracey-arm could you please review the updates ?

Copy link
Contributor

@kjbracey kjbracey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me now. But does it work? (That's quite a distinct thing from "looking correct to Kevin"...)

Only thing is that now I can see the atomic bits more clearly there's a flaw there.

}
return NSAPI_ERROR_NO_MEMORY;
}
if (!core_util_atomic_exchange_bool(&socket->txfull_event, true) && !_txfull_event_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need atomic handling of the txfull_event, you also will need atomic handling of the txfull_event_id.

Indeed, that's the more important one, as the call_in has side-effects.

At this point, I'd say keep it simple by protecting both with the _socket_mutex. Any reason not to do that? Much as I like atomics, they're a pain to think about compared to mutexes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atomic implementation is a kind of copy from ESP8266Interface. I'm not aware of any call_in side effects, but if it the queue guarantees that a new event is always generated after call_in (also while executing the same event) then additional locking is not needed here. However, if you think mutex is more safe for some reason I can change the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline - short answer is that multiple call_ins are fine, it's that this could lose track of what it's queued, so the clean-up attempt in the destructor becomes ineffective.

(Plus the unprotected multithread access to the variable is formally undefined).

@AriParkkila
Copy link
Author

@kjbracey-arm Please review, changed atomic to mutex and UDP to return NO_MEMORY instead of WOULDBLOCK.

@@ -100,7 +100,9 @@ void UDPSOCKET_ECHOTEST_BURST()
if (check_oversized_packets(sent, tx_buffers[x].len)) {
TEST_IGNORE_MESSAGE("This device does not handle oversized packets");
}
TEST_ASSERT_EQUAL(tx_buffers[x].len, sent);
if (sent != NSAPI_ERROR_NO_MEMORY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd be even more relaxed on this - I'd let it return any error, or the size sent.

But this is fine, as it's relaxing it somewhat.

(The code handling oversized packets seems wrong to me, but that's another issue).

@@ -132,6 +133,7 @@ class AT_CellularStack : public NetworkStack {
bool tx_ready; // socket is ready for sending on modem stack
bool tls_socket; // socket uses modem's internal TLS socket functionality
nsapi_size_t pending_bytes; // The number of received bytes pending
bool txfull_event; // socket event after wouldblock
Copy link
Contributor

@kjbracey kjbracey Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could save RAM putting this bool next to connected

@kjbracey
Copy link
Contributor

CI started

@mbed-ci
Copy link

mbed-ci commented Dec 23, 2019

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 1
Build artifacts

@0xc0170 0xc0170 removed the needs: CI label Dec 23, 2019
@0xc0170 0xc0170 added the release-version: 6.0.0-alpha-1 First pre-release version of 6.0.0 label Dec 23, 2019
@0xc0170 0xc0170 merged commit 1623b14 into ARMmbed:master Dec 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-version: 6.0.0-alpha-1 First pre-release version of 6.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants