Skip to content

[NETSOCKET] TCPSocket & UDPSocket event handler cannot be run in IRQ context #4343

Closed
@betzw

Description

@betzw
Contributor

Description

  • Type: Bug
  • Priority: Major

Bug

meed-os sha:
8f29517: Merge pull request #4317 from c1728p9/reduce_test_overhead
(i.e. current master)

Expected behavior
Sockets should not register callbacks (on state change events) which are not allowed to be executed in interrupt context.

Actual behavior
In method Socket::open() the virtual method Socket::event() gets registered as callback on state change of the socket. The API spec for NetworkStack::socket_attach() says:
The callback may be called in an interrupt context and should not perform expensive operations such as recv/send calls.
Unfortunately, both TCPSocket::event() and UDPSocket::event() call Semaphore::wait(0) which is not allowed to be called in interrupt context (following the CMSIS-RTOS specs for osSemaphoreWait()). This call consequently returns -1 rather than the number of available tokens causing (possibly erroneously) the release of the semaphore.

Activity

0xc0170

0xc0170 commented on May 19, 2017

@0xc0170
Contributor

cc @kjbracey-arm @geky

geky

geky commented on May 19, 2017

@geky
Contributor

Good catch! Yeah, calling Semaphore::wait is an error in that context. We should remove it to avoid having code that doesn't do what it appears to. Also relying on error codes is probably just a bad idea.

However, while the Semaphore::wait call is incorrect, it's not going to break the network stack. Both of the TCP and UDP event semaphores are handled in a while loop like this:
https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L50-L76

The extra semaphore releases may waste cycles, but it shouldn't break anything.

At the moment there is an ongoing effort to bring over cmsis-rtos 2 (#4294), which will introduce bounded semaphores. A bounded semaphore would provide the correct behaviour that this logic is trying to accomplish and should be easy enough to switch to once the cmsis-rtos 2 pr is in.

geky

geky commented on Jun 15, 2017

@geky
Contributor

Note: this should be resolved by #4560

betzw

betzw commented on Jun 19, 2017

@betzw
ContributorAuthor

Well #4560 has been closed and not merged, so maybe it is not solved yet. Pls. correct me if I am wrong!

On the other hand, this one has become a real issue on my end, as the while loops you have indicated are actually failing to handle timeouts (correctly). This is just because, in the worst case, the extra semaphore releases might be called every time a character is received (in my case it is an ATParser based Wi-Fi module driver using the serial line as communication interface) and such outperform over the number of calls to Semaphore::wait().

betzw

betzw commented on Jul 18, 2017

@betzw
ContributorAuthor

Any progress on this?

cc @MarceloSalazar @screamerbg
This is currently a blocker for the SPWF01SA wifi driver!

geky

geky commented on Jul 18, 2017

@geky
Contributor

Sorry, the pr moved around a bit. I believe it's over here now: #4580

Although I think the wait call has already been removed from the socket class:
https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L191-L192

Are you running into an issue with the current master?

betzw

betzw commented on Jul 19, 2017

@betzw
ContributorAuthor

I am running into issues using mbed-os-5.5.x.
I have never tried the current master as I am in the middle of development and wanted a stable mbed-os beyond.

Anyway, I am afraid that https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L191-L192 will not resolve my issue as it does exactly what the previous version actually did, i.e. incrementing indistinctly the semaphores which in while loops like this might lead to timeouts being completely missed sunch ending up as endless loops.
This is because the stack operations (e.g. _stack->socket_connect() or _stack->socket_recv(), ecc.) called therein might cause the TCPSocket callback TCPSocket::event() being called at least one if not several times even if these operations might return an error. As a consequence count = _write_sem.wait(_timeout) will always lead to (count < 1) being false and therefor we will not break out and return from the loop ⚡

betzw

betzw commented on Jul 19, 2017

@betzw
ContributorAuthor
geky

geky commented on Jul 20, 2017

@geky
Contributor

Ah yes, this should be fixed by #4580

betzw

betzw commented on Jul 21, 2017

@betzw
ContributorAuthor

Sounds good, once I have the time and/or all related PRs are merged I will give it a shot and let you know!

betzw

betzw commented on Sep 6, 2017

@betzw
ContributorAuthor

Hi again @geky,
Right in this moment I am trying out PR #4580 which in the meantime has been merged and am learning that not even this one is resolving my issue 🤕. This is again because the stack operations (e.g. _stack->socket_connect() or _stack->socket_recv(), ecc.) called therein might cause the TCPSocket callback TCPSocket::event() being called at least one if not several times even if these operations might return an error. The reason for this is, that in my implementation those operations require bi-directional communication with the WiFi module to be executed even if there is no data available in the module.
As a consequence _event_flag will always be set and therefor we will not break out and return from the loop ⚡
Thinking about it now, I am actually wondering if something like this can be solved at the NETSOCKET level at all or if it's not my duty to find a solution for this?
Your opinion on this is more than welcome!

geky

geky commented on Sep 6, 2017

@geky
Contributor

It should be fine for TCPSocket::event to get called extra times. We intentional support spurious events in the socket APIs.

This condition exits the function if socket_recv returned any code other than NSAPI_ERROR_WOULD_BLOCK. Following the code path in socket_recv, the _event_flag is only checked if socket_recv returns NSAPI_ERROR_WOULD_BLOCK.

So as long as socket_recv returns a negative error code or a positive count of bytes sent, the function should exit correctly. The only downside of _event_flag always being set is that the loop checking socket_recv will never sleep. Could it be you are always returning NSAPI_ERROR_WOULD_BLOCK?

Are multiple threads involved? It could be the socket_recv loop is starving your driver's thread. What priorities are your threads running at?

kjbracey

kjbracey commented on Sep 7, 2017

@kjbracey
Contributor

There is an issue in that in the event of a spurious wakeup, timeout is not adjusted, so we restart the wait from the original timeout. Which means a block that should time out can go on longer than it's supposed to or forever.

Is that the problem you're seeing?

I believe the code is fine as long as timeout is 0 or infinite though.

betzw

betzw commented on Sep 7, 2017

@betzw
ContributorAuthor

Sorry for having said that a bit badly.
In my situation function socket_recv() always returns NSAPI_ERROR_WOULD_BLOCK and during its call the _event_flag will always be set .
Not sure what happens for immediate timeout, but most likely the wait will return with the information that the flag was set and therefore we will retry the execution of socket_recv() as it does for all other timeout cases!
So an infinite loop seems to be guaranteed in case the (real) data never arrives ...

kjbracey

kjbracey commented on Sep 7, 2017

@kjbracey
Contributor

Ah, so the mere act of making a socket call to the driver always causes the driver to trigger another event.

Yes, I don't see how that's going to fly. Your driver is going to have to be more selective about how it triggers events.

Netsocket can't itself suppress an event coming during a socket call - it may be that it was a valid, and not-to-be-repeated wakeup.

Can you arrange some sort of suppression so serial IRQs don't trigger the event while you're executing serial comms? You would need to take care at the end to transition from "parsing, IRQs disabled" to "nothing more to parse, IRQs enabled for next event" atomically and reliably.

Alternatively, you could restructure to a more detailed implementation of the event - only raise events from the OOB handler that parses the notification from the modem, and only for that specific socket.

That means you need your own background parsing thread that monitors the port - it's that which would be woken by the interrupt to run the parser, which then would send the events.

This is quite an elegant implementation of that approach (using AtCmdParser, but same general technique would work with AtParser):

https://github.com/RobMeades/mbed-os/blob/rob_fiddle/features/cellular/TARGET_UBLOX_MODEM_GENERIC_AT_DATA/ublox_modem_driver/UbloxCellularDriverGenAtData.cpp

betzw

betzw commented on Sep 7, 2017

@betzw
ContributorAuthor

Ah, so the mere act of making a socket call to the driver always causes the driver to trigger another event.

Exactly, as in the reference implementaion for esp8266 the attached callback functions get simply forwarded to the Serial object, which calls the callback everytime the RX ISR exists (and there was data).

That means you need your own background parsing thread that monitors the port - it's that which would be woken by the interrupt to run the parser, which then would send the events.

Well, this would make the driver dependent on the RTOS being present (MBED_CONF_RTOS_PRESENT) something I would like to avoid.

kjbracey

kjbracey commented on Sep 7, 2017

@kjbracey
Contributor

In which case, I would suggest you disable the interrupt by unattaching the callback from BufferedSerial before you start your parser operation, and reattach after you've finished.

Looking at the BufferedSerial implementation, I think you'd need to include a manual if readable(), call my handler directly afterwards, to avoid IRQs being lost between you giving up parsing and re-enabling the IRQs.

betzw

betzw commented on Sep 7, 2017

@betzw
ContributorAuthor

Currently, I am doing something very similar.
I you are interested you can just take a look at the driver for the X-NUCLEO-IDW01M1 expansion board.

kjbracey

kjbracey commented on Sep 7, 2017

@kjbracey
Contributor

I see you sending extra commands to the modem to tell it to stop producing indication messages, but it's the generation of the receive interrupts while you send commands that's the problem.

betzw

betzw commented on Sep 7, 2017

@betzw
ContributorAuthor

If you dig a bit further you will also find methods _block_event_callback(), _unblock_event_callback(), and the like, which do the crucial things regarding this issue.

kjbracey

kjbracey commented on Sep 7, 2017

@kjbracey
Contributor

Thanks, I see it now. The intent seems to be that no events are signalled because you talk to the modem. And I see no obvious flaws. But apparently it's failing, and you are generating events?

betzw

betzw commented on Sep 7, 2017

@betzw
ContributorAuthor

Two months ago when I was discussing this issue with @geky I still had the impression that this issue should and could be resolved at NETSOCKET level. So, in order to be able to proceed with the implementation of the driver for the X-NUCLEO-IDW01M1 expansion board, I started to implemented what at the time I thought would be just a workaround until related PRs (including #4580) would have been integrated into mbed-os. It was just yesterday that I started to understand that this actually is not just a workaround but necessary for having the driver working correctly together with NETSOCKETs.
This (former workaround) version seems to work correctly and is not failing as you might have understood ... Sorry for the misunderstanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @geky@0xc0170@betzw@kjbracey

        Issue actions

          [NETSOCKET] TCPSocket & UDPSocket event handler cannot be run in IRQ context · Issue #4343 · ARMmbed/mbed-os