-
Notifications
You must be signed in to change notification settings - Fork 3k
EventQueue timing errors when using tickless mode #13801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@marcemmers thank you for raising this issue.Please take a look at the following comments: It would help if you could also specify the versions of any tools you are using? NOTE: If there are fields which are not applicable then please just add 'n/a' or 'None'.This indicates to us that at least all the fields have been considered. |
Thank you for raising this detailed GitHub issue. I am now notifying our internal issue triagers. |
We have updated to OS version 6.4.0 and I can confirm that the issue is still present. It would be nice to get some feedback on this because this is kind of a showstopper for us. |
@ARMmbed/mbed-os-core
The problem is only when debugging ?
Didn't they actually fix it by disabling sleep manager for the period they needed to process something fast? Then reenable.
What is the version it works, only 5.12.0 (so we shall compare 5.15.0 vs 6.1.0 ?) What value does |
The issue is indeed only present when debugging or halting the cpu in any way. This stems from the RTC continuing running in the background but the systick irq not being handled.
Looks to me like maiorfi responded to hudakz that the issue was fixed. Hudakz was talking about the tickless option.
I think the issue originates from the changes made to equeue_mbed.cpp in this PR: #10104. There is a change going from a single clock source to multiple clock sources depending on the isr context. This means 5.13.0 should work and 5.14.0 would have the issue.
MBED_CONF_EVENTS_USE_LOWPOWER_TIMER_TICKER is 0 but that does not have any effect because MBED_CONF_RTOS_API_PRESENT is 1 so MBED_CONF_EVENTS_USE_LOWPOWER_TIMER_TICKER is not even checked. I can reproduce the issue with a small bit of code which also shows the timing source issue I was talking about. I use the SEGGER RTT functionality so there is no influence to the timing from a serial port. For the record, this was tested on a NRF52840_DK with shared_dispatch_from_application enabled. main.cpp:
I also added a print in the equeue_tick function:
This is the result when the execution was halted for a while at the start:
This is the result when a hardware reset was performed without the debugger attached:
Even when ran into a breakpoint the timing starts to go off. This is tested with the following steps:
|
@kjbracey-arm Can you review? |
Some thoughts. The time APIs being looked at here are supposed to all be coming from the same ultimate source (HAL lpticker in this config, so RTC), and they should be being kept in sync. This is a tickless RTOS build, right, not a non-RTOS build? Does this happen while stopping with the debugger while running (eg a tight spin loop), or sleeping, or both? That changes what's going on - while running we are ticking, while sleeping we have separate calculations. While running, I think overrun should be dealt with by triggering multiple ticks after a hold-up. Every accumulated 1000us read out from the lpticker clock should deliver a tick to RTX - getting a back-to-back stream of them if necessary. In the event that the overrun is occurring due to debugging that intended tick burst might be unhelpful. I can't remember if I deliberately limited how many it would deliver, but if it is limited, then it should be tracking the ongoing Aside from the problem here - you might want to change some config registers in the device to stop the clock source when stopping the CPU. It's probably ultimately less confusing overall to freeze everything. Maybe not viable if the source is the RTC. |
Looking more closely, it looks like it should be generating catch-up bursts. There are two layers here - firstly And then those interrupts in I can see there's a danger of time being non-monotonic, due to the two readouts, but in principle it could be okay. During the "wake up from stop" burst, there should be back-to-back interrupts keeping us using the top "in ISR" path giving "raw" numbers. Once we've caught up, then it can use the RTX numbers. One of the reasons for this messing is that without it, what happens is that when woken up from tickless sleep, Still seems like there's something wrong in the timer system, but maybe some rework to equeue could stop it asking about time in interrupt context? |
This is indeed a normal RTOS build just with tickless enabled for low power.
When stopping while the device is in sleep the issue is not there. I suspect because in that case I will be halted before the exit of hal_sleep() so this is before the kernel is restarted. As far as I've seen there will be no burst of kernel tick increments after a stop. Instead, the kernel is suspended before going to sleep and resumed after sleep; the time slept is being calculated in Interrupts will be handled before the kernel is resumed and thus the ticks are not updated yet. So I understand why there is a different time source in interrupts but that is also kind of the source of the problem.
I've looked into this and this is indeed not an option with the NRF52 series. |
I'd expect the catch-up burst to occur if stopped in the running state, not sleeping. While running the timing system needs to deliver a tick call every 1ms to RTX - there's no other way to feed it time progress, so it would have to send a burst to keep up the uniform time base. If stopped while sleeping, I'd expect that "how long were we asleep" code for tickless should just measure it and do its job. |
Ok that makes sense. So indeed the issues are occurring in running stage. If we are halted in running state Even if it would be running, _unacknowledged_ticks being an uint8_t would cause overflows and can be easily fixed using a bigger type. |
When you start running after a debugger freeze, there would be a pending timer IRQ so it would call All of this was designed only for occasional overruns in normal operation, and the System could be optimised to handle this by noting the large chunk of missing time and just doing a bulk increment to unacknowledged ticks, but not sure if that might complicate the normal fast path. |
I have a vague recollection that part of the intent might have been that I didn't want to be calling catch up things thousands of times after a big debugger-type freeze, so making that count uint8_t would just quietly wrap and we'd throw some time away. But apparently that wrap leaves everything out-of-sync. EDIT: comment in |
Switching to uint32_t indeed fixes the issue so the tick handling works as intended. As you were saying the only issue with this is the amount of operations required to process all the unacknowledged ticks after several seconds of halted state. Unfortunately there is no API available in the rtx layer to increment the ticks by a set amount without suspending and resuming the kernel or calling the rtx tick handler. The latter also performing the round robin routine which is not necessary because there are more ticks to follow. |
The coyness about not having a big tick backlog to RTX seems futile in this situation, because we are still doing an awful lot of Maybe to speed that up there could be a variant of The aim here is to try to have the time base run smoothly against real time - I was thinking more of system suspend type scenarios, and current state-of-the-art appears to be that monotonic timers do track time spent in system suspend, so there can sometimes be catch-up on wakeup. But most of that case in RTX would be dealt with via the osKernelResume parameter. Whether that's the best thing to do for debugging freezes is less clear. We could choose to drop ticks beyond some threshold. But for now I'd be inclined to keep it simple, and just raise |
I agree with this in that this is the easiest solution for the issue we have now. Since this only happens when one is debugging and halting the system I suppose the impact of a large backlog should be smaller. If you can halt the system for a certain amount of time then there should be no issue in processing the back log after continuing execution. It would be nice to add a method to prevent all the cyclic context switches when processing the backlog to improve performance but that's really a nice to have. I will try testing the 32 bit solution with the normal software and see where that brings us. |
Same problem has been seen without a debugger in #13772. Pretty nasty system starvation from some flash write, but they otherwise get away with it because it's just a one-off start-up event. 32-bit counter would fix that too. |
I did search for similar issues but I guess I missed that one. In my tests there is the same multiple of 256ms plus 1000 for the second sleep time. Do you want me to make a PR or will you do it yourself? |
I was mentioned a few times in that one but missed it. Seems @mikaleppanen was ahead of us. Could you make a PR for 32-bit? (And some intelligent adjustment to that comment about 8-bit - add a link to this issue.) |
Good findings here ! |
Uh oh!
There was an error while loading. Please reload this page.
Description of defect
We have recently been experiencing issues with the timing of the EventQueue since switching to mbed 6.1.0 from 5.12.0. After some testing we have found out that the issue originates from the
equeue_tick
function is equeue_mbed.cpp. This gets the ticks from a different source if its called from an interrupt context (but only in tickless mode).When debugging we halt the execution when it lands on main or on a breakpoint. While the core is halted the kernel clock is no longer updated but the rtc peripheral is still running. When we start or continue the program the rtc clock now is ahead of the kernel clock. If an event is posted to the queue in interrupt context, the tick count is also read in interrupt context from the rtc clock, so the wakeup time for that event is ahead of the kernel time which is used to determine the next wakeup event. This leads to a delay in event handling. The delay is proportionate to the time spent halted.
I've tried searching for an easy solution but I couldn't find one. In my opinion there should be only one timebase used for the eventqueue ticks but that's easier said then done. The kernel clock gets updated after the interrupts have been handled and its not that easy to update it while still in the critical sleep section.
There is an issue on the forum which looks to be the same issue: https://forums.mbed.com/t/eventqueue-call-target-method-triggers-much-later-than-expected/10163. It was "solved" by disabling tickless mode but that is not an option for us.
Target(s) affected by this defect ?
NRF52840 and probably NUCLEO_L476RG
Toolchain(s) (name and version) displaying this defect ?
IAR 8.32.3
What version of Mbed-os are you using (tag or sha) ?
mbed-os-6.1.0
What version(s) of tools are you using. List all that apply (E.g. mbed-cli)
n/a
How is this defect reproduced ?
The code from the forum should reproduce this issue. I don't know if it is necessary to halt execution for a while to reproduce the issue.
The text was updated successfully, but these errors were encountered: