Description
Description
- Type: Bug
- Priority: Major
Bug
meed-os sha:
8f29517: Merge pull request #4317 from c1728p9/reduce_test_overhead
(i.e. current master
)
Expected behavior
Sockets should not register callbacks (on state change events) which are not allowed to be executed in interrupt context.
Actual behavior
In method Socket::open()
the virtual method Socket::event()
gets registered as callback on state change of the socket. The API spec for NetworkStack::socket_attach()
says:
The callback may be called in an interrupt context and should not perform expensive operations such as recv/send calls.
Unfortunately, both TCPSocket::event()
and UDPSocket::event()
call Semaphore::wait(0)
which is not allowed to be called in interrupt context (following the CMSIS-RTOS specs for osSemaphoreWait()
). This call consequently returns -1
rather than the number of available tokens causing (possibly erroneously) the release of the semaphore.
Activity
0xc0170 commentedon May 19, 2017
cc @kjbracey-arm @geky
geky commentedon May 19, 2017
Good catch! Yeah, calling
Semaphore::wait
is an error in that context. We should remove it to avoid having code that doesn't do what it appears to. Also relying on error codes is probably just a bad idea.However, while the
Semaphore::wait
call is incorrect, it's not going to break the network stack. Both of the TCP and UDP event semaphores are handled in a while loop like this:https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L50-L76
The extra semaphore releases may waste cycles, but it shouldn't break anything.
At the moment there is an ongoing effort to bring over cmsis-rtos 2 (#4294), which will introduce bounded semaphores. A bounded semaphore would provide the correct behaviour that this logic is trying to accomplish and should be easy enough to switch to once the cmsis-rtos 2 pr is in.
geky commentedon Jun 15, 2017
Note: this should be resolved by #4560
betzw commentedon Jun 19, 2017
Well #4560 has been closed and not merged, so maybe it is not solved yet. Pls. correct me if I am wrong!
On the other hand, this one has become a real issue on my end, as the while loops you have indicated are actually failing to handle timeouts (correctly). This is just because, in the worst case, the extra semaphore releases might be called every time a character is received (in my case it is an
ATParser
based Wi-Fi module driver using the serial line as communication interface) and such outperform over the number of calls toSemaphore::wait()
.betzw commentedon Jul 18, 2017
Any progress on this?
cc @MarceloSalazar @screamerbg
This is currently a blocker for the SPWF01SA wifi driver!
geky commentedon Jul 18, 2017
Sorry, the pr moved around a bit. I believe it's over here now: #4580
Although I think the wait call has already been removed from the socket class:
https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L191-L192
Are you running into an issue with the current master?
betzw commentedon Jul 19, 2017
I am running into issues using
mbed-os-5.5.x
.I have never tried the current
master
as I am in the middle of development and wanted a stable mbed-os beyond.Anyway, I am afraid that https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L191-L192 will not resolve my issue as it does exactly what the previous version actually did, i.e. incrementing indistinctly the semaphores which in while loops like this might lead to timeouts being completely missed sunch ending up as endless loops.
This is because the stack operations (e.g.
_stack->socket_connect()
or_stack->socket_recv()
, ecc.) called therein might cause theTCPSocket
callbackTCPSocket::event()
being called at least one if not several times even if these operations might return an error. As a consequencecount = _write_sem.wait(_timeout)
will always lead to(count < 1)
beingfalse
and therefor we will not break out and return from the loop ⚡betzw commentedon Jul 19, 2017
@nikapov-ST
geky commentedon Jul 20, 2017
Ah yes, this should be fixed by #4580
betzw commentedon Jul 21, 2017
Sounds good, once I have the time and/or all related PRs are merged I will give it a shot and let you know!
betzw commentedon Sep 6, 2017
Hi again @geky,
Right in this moment I am trying out PR #4580 which in the meantime has been merged and am learning that not even this one is resolving my issue 🤕. This is again because the stack operations (e.g.
_stack->socket_connect()
or_stack->socket_recv()
, ecc.) called therein might cause theTCPSocket
callbackTCPSocket::event()
being called at least one if not several times even if these operations might return an error. The reason for this is, that in my implementation those operations require bi-directional communication with the WiFi module to be executed even if there is no data available in the module.As a consequence
_event_flag
will always be set and therefor we will not break out and return from the loop ⚡Thinking about it now, I am actually wondering if something like this can be solved at the
NETSOCKET
level at all or if it's not my duty to find a solution for this?Your opinion on this is more than welcome!
geky commentedon Sep 6, 2017
It should be fine for TCPSocket::event to get called extra times. We intentional support spurious events in the socket APIs.
This condition exits the function if socket_recv returned any code other than NSAPI_ERROR_WOULD_BLOCK. Following the code path in socket_recv, the _event_flag is only checked if socket_recv returns NSAPI_ERROR_WOULD_BLOCK.
So as long as socket_recv returns a negative error code or a positive count of bytes sent, the function should exit correctly. The only downside of _event_flag always being set is that the loop checking socket_recv will never sleep. Could it be you are always returning NSAPI_ERROR_WOULD_BLOCK?
Are multiple threads involved? It could be the socket_recv loop is starving your driver's thread. What priorities are your threads running at?
kjbracey commentedon Sep 7, 2017
There is an issue in that in the event of a spurious wakeup, timeout is not adjusted, so we restart the wait from the original timeout. Which means a block that should time out can go on longer than it's supposed to or forever.
Is that the problem you're seeing?
I believe the code is fine as long as timeout is 0 or infinite though.
betzw commentedon Sep 7, 2017
Sorry for having said that a bit badly.
In my situation function
socket_recv()
always returnsNSAPI_ERROR_WOULD_BLOCK
and during its call the_event_flag
will always be set .Not sure what happens for immediate timeout, but most likely the wait will return with the information that the flag was set and therefore we will retry the execution of
socket_recv()
as it does for all other timeout cases!So an infinite loop seems to be guaranteed in case the (real) data never arrives ...
kjbracey commentedon Sep 7, 2017
Ah, so the mere act of making a socket call to the driver always causes the driver to trigger another event.
Yes, I don't see how that's going to fly. Your driver is going to have to be more selective about how it triggers events.
Netsocket can't itself suppress an event coming during a socket call - it may be that it was a valid, and not-to-be-repeated wakeup.
Can you arrange some sort of suppression so serial IRQs don't trigger the event while you're executing serial comms? You would need to take care at the end to transition from "parsing, IRQs disabled" to "nothing more to parse, IRQs enabled for next event" atomically and reliably.
Alternatively, you could restructure to a more detailed implementation of the event - only raise events from the OOB handler that parses the notification from the modem, and only for that specific socket.
That means you need your own background parsing thread that monitors the port - it's that which would be woken by the interrupt to run the parser, which then would send the events.
This is quite an elegant implementation of that approach (using AtCmdParser, but same general technique would work with AtParser):
https://github.com/RobMeades/mbed-os/blob/rob_fiddle/features/cellular/TARGET_UBLOX_MODEM_GENERIC_AT_DATA/ublox_modem_driver/UbloxCellularDriverGenAtData.cpp
betzw commentedon Sep 7, 2017
Exactly, as in the reference implementaion for
esp8266
the attached callback functions get simply forwarded to theSerial
object, which calls the callback everytime the RX ISR exists (and there was data).Well, this would make the driver dependent on the RTOS being present (
MBED_CONF_RTOS_PRESENT
) something I would like to avoid.kjbracey commentedon Sep 7, 2017
In which case, I would suggest you disable the interrupt by unattaching the callback from BufferedSerial before you start your parser operation, and reattach after you've finished.
Looking at the BufferedSerial implementation, I think you'd need to include a manual
if readable(), call my handler directly
afterwards, to avoid IRQs being lost between you giving up parsing and re-enabling the IRQs.betzw commentedon Sep 7, 2017
Currently, I am doing something very similar.
I you are interested you can just take a look at the driver for the X-NUCLEO-IDW01M1 expansion board.
kjbracey commentedon Sep 7, 2017
I see you sending extra commands to the modem to tell it to stop producing indication messages, but it's the generation of the receive interrupts while you send commands that's the problem.
betzw commentedon Sep 7, 2017
If you dig a bit further you will also find methods
_block_event_callback()
,_unblock_event_callback()
, and the like, which do the crucial things regarding this issue.kjbracey commentedon Sep 7, 2017
Thanks, I see it now. The intent seems to be that no events are signalled because you talk to the modem. And I see no obvious flaws. But apparently it's failing, and you are generating events?
betzw commentedon Sep 7, 2017
Two months ago when I was discussing this issue with @geky I still had the impression that this issue should and could be resolved at NETSOCKET level. So, in order to be able to proceed with the implementation of the driver for the X-NUCLEO-IDW01M1 expansion board, I started to implemented what at the time I thought would be just a workaround until related PRs (including #4580) would have been integrated into mbed-os. It was just yesterday that I started to understand that this actually is not just a workaround but necessary for having the driver working correctly together with NETSOCKETs.
This (former workaround) version seems to work correctly and is not failing as you might have understood ... Sorry for the misunderstanding.