-
Notifications
You must be signed in to change notification settings - Fork 3k
Custom target silently fails to start main app after adding mbed_trace_init();
to bootloader
#11205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In the faulting condition, it seems that it diverges from 72 because the contents of Moving on, line 93 is where the exception is caused. This is the line:
The values involved here are as follows:
I don't really understand the *n.b. I got this value in gdb using |
@AGlass0fMilk @40Grit @0xc0170 Tagging you guys now that I've got something worth showing and the issue thread has moved. |
The bootloader->main image transition basically tries to "reset" hardware things back to a safe state, then effectively runs the main image as if through the reset vector, so everything gets reinitialised.
All software state is lost, so it's not about closing - we effectively do a software "reset" in But if the hardware is not correctly reset by either the bootloader's shutdown or main image's start-up code, then the new image could be being confused by hardware not being in the expected state. In this case though, it seems like maybe it's just a failure to initialize RAM correctly? At the point that first SVC_Handler is called, RTX should not have been initialised yet, meaning all its static I would expect that structure to have been placed in the Check to see why that apparently hasn't happened, or has gone wrong. Does the linker map show it between Maybe stick a watchpoint on the location of
It's expecting R1 to be a pointer to a |
Couldn't find a var named "osThreadInfo" so I'm assuming you meant "osRtxInfo". Here's the watchpoint trace of this var from start to fail:
That last line where the value of
Not sure why having mbed_trace present is preventing this from being assigned to |
What I'm looking for is where From the dump, it looks like there's been no attempt to actually initialise that structure at the point you've stopped in the SVC Handler. The initialisation code should be copying the initialised data from |
To double-check - the crash is happening in the main image, right? As it tries to initialise its OS? Also, I'd like to see the lines from |
Are all those watchpoints from the main image execution - so the It looks like it is being set to zero, but I don't see where that So mismatch between loaded image and ELF in the debugger? Is this is a custom bootloader you've created that has the RTOS in use? Our own bootloaders disable the RTOS, so you may be hitting some sort of problem with interrupts not being properly shut down that we haven't? The fact that Are you sure you've got the right ELF file in the debugger? |
To be really clear here, I was previously using the (
*also note that this map file was generated from mbed-cli and I found it in the build folder of my main app project. The chunk of the |
To verify -- indeed the crash occurs within the main image as it tries to initialise the OS (I think more specifically it's failing to initialise the Kernel)
So this region in the main app is in my comment above. That same region in the
|
I have both the bootloader and main application ELF files loaded into GDB. I've set a breakpoint at
So it seems the majority of the watch points are hit before entering the main app.
So perhaps there's an offset here causing this 0x10 that was actually meant for the field
See my bottom response in this comment.
I've had a few people ask me about the bootloader. It's a managed mbed bootloader where the contents of
SVC_Handler is used for software interrupts right? Is there a way we can trace back to see who's calling these interrupts before
So I was making the mistakes of assuming that when I typed @kjbracey-arm What information would you like me to redo now that I'm properly flashing the binaries and loading the elfs to match? |
Interesting updateSeeing above that there was some offset business I thought - at a high level here - perhaps this offset is coming from trace existing in the bootloader but not existing in the main application? So I initialised trace in my blinky (main) program as well and reflashed the image. Sure enough, the bootloader successfully boots to the main app now and both are able to use trace messages.
I'm not sure what this tells us exactly, but I think this means for now I can have a working bootloader so long as the peripherals I use in it are also present in the main application as well. Surely this isn't intended behaviour? EDIT: Something else that sucks is this problem exists backwards too. Meaning, if I have trace defined in my main app, but not in my bootloader, then it faults as well. EDIT2: Another thing that may or may not be relevant -> The custom target we're using does not have an external low freq clock (no LSE). We have an 8 MHz external that we're using as for high freq stuff configured as the HSE, but for the low freq stuff it's being done with the LSI. I think I've configured my target correctly with this:
But if I've made a mistake maybe that could be causing some troubles. |
As @kjbracey-arm stated the bootloader is supposed to do a soft reset back into the reset vector which should avoid any of this weirdness. |
How/Where would I confirm this? |
This makes me nervous - I've no experience with loading two ELF files simultaneously into GDB, and I'm not sure how you disambiguate. Some of your output would be consistent with you printing the content of the bootloaders Could you instead just load the ELF file for the main image into GDB?
No, but it is an RTOS build, which means the RTOS is getting initialised, and you're running your bootloader main as an RTOS thread. As of 5.12/5.13, the RTOS can be excluded via a
Well, it's not really a reset, it's just "manually safe hardware and jump to Reset_Handler". Where that fails is for any hardware that was disrupted by the bootloader and not restored, and assumed to be in reset state by the main image.
If using an up-to-date pyOCD for your connection, then a breakpoint on SVC_Handler should show you - the calling point would be shown as a different "thread" - there's 1 thread for handler mode, and 1 (or more) threads for process mode. If using a more archaic tool, then you'd have to dump the 8 words from SP while stopped on a SVC_Handler breakpoint - the return address for the SVC would be in the seventh word at SP+24. Show all 8 words in a dump, and that will tell us exactly what the supervisor call was. (SP+16 should be the address of an RTX function - look it up in the map). These really should not be happening. Anyway, all your memory maps look correct, so don't need to worry about them any more. But we still need to figure out what's going on with that 0x10 pointer that makes it abort - can you go around again with correctly matched ELF (main image only) and run through the boot? You won't have source references then for the bootloader bit, but that doesn't matter.
There isn't a very rigorous cleanup, tbh. I can imagine a situation where a peripheral is left active, then after entering the main image there's no driver (or init) code for it. Its interrupt generation should at least be masked by the There's no general system "shut everything down" notification to give things a chance to clean-up before the |
@kjbracey-arm Off topic but need to ask: |
In principle, no, there shouldn't be. In practice, maybe? In bigger systems, main images (like the Linux kernel) tend to be very paranoid about what their bootloader may or may not have done, so they tend have init code that manually sets/reset pretty much everything on entry. The bootloader could be years behind the kernel, so they just do not trust "reset" hardware state. That does take code space though. It would be more space-efficient to trust the bootloader to put everything back it knows it touched, and have the main image just assume everything is in reset state. As we're currently in a situation where neither of those is really happening, then I guess having bootloader and main image as close as possible does in practice minimise the chances of errors - such as this case potentially is. |
An ideal would be to have a chip that did support "real reset into secondary image". Have a register that the bootloader could write to that did "reset, but jump into this handler". Or maybe you could even do that in bootloader software - if you could reliably indicate "reset reason", then the bootloader itself could do I don't know if that approach has been attempted in Mbed OS - I've only seen the manual simulated-reset approach in |
My earlier comment about soft b reset was under the assumption that Mbed-os was using reset-reason and potentially some other core register. I see now that is not the case. |
Thinking further that would require some sort of industry standards between BL and application to use the core in that way. I'll make any further enquires into this topic in a separate issue. |
@40Grit @kjbracey-arm Going to try the bare metal setting when I get back, but I'll be out until tuesday. Just FYI so you don't think I've given up haha |
Sorry for the delay, but I've returned and tried adding "requires bare-metal" to my
I'm still crashing inside of
where After this if I try to go forward (stepping), I end up at I'm confused with what's going on here. Why is it accessing code at EDIT: After adding the bootloader elf again to see what's going on, it seems to be accessing SVC_Handler code and just crashing as before:
|
@kjbracey-arm @40Grit Just for the sake of sanity, could either of you two reiterate the requirements for defining a custom target? I'm worried I may have missed a very basic step here and perhaps this is the source of the trouble here. I've read over this section from you guys a number of times but I want to be sure here. I'm specifically concerned about this section here:
Is my target definition for an MCU, Family, or Subfamily? Perhaps I'm missing one of these fields and that's what's causing this. |
I have no direct affiliation with ARM. I just work for one of their partners (Embedded Planet) Without watching the config, build, and debug session right in front if me I'm starting to fish for ideas. Another sanity check could be to check the parts errata. What linker file does your build end up using? @kjbracey-arm might the image built for the nucleo board run in the processor that @DrynnBavis is using since it really only differs by package? |
I've been through this for the STM32F4xH already and nothing jumped out at me.
Pretty sure I'm using this one here |
Some more poking around I found the the osKernelPreInit section of assembly code is simply two lines, first is a |
@DrynnBavis if your bootloader has no operational dependency on peripheral IO, I would see what happens if you flash the working binaries built for the nucleo board. |
I'm missing something. Apparently the bootloader and application work fine on the nucleo board. (f413zh) |
It's interesting that this target code feels the need to explicitly set VTOR for ROM anyway. Given that we've just reset, we surely must have entered our Reset Handler, so VTOR must be correctly set? The only issue is that the bootloader itself doesn't modify VTOR before jumping into the reset handler manually. That would be a reason to set VTOR in main image, but surely the bootloader should have done it itself... Edit: actually it does, |
According to whom? Whose bootloader? @DrynnBavis's? Looking at that code, I don't see how any bootloader would work - Mind you, I'm just asserting that by code inspection. No idea who's using a Nucleo F413ZH. |
But was that working by jumping into the bootloader and all the data happening to be in the same place, as it does on the custom board sometimes, if the bootloader is RTOS-based? |
I see on that issue confirmation that VTOR was correct (0x08020000) entering the image, but by reading the code, The moral of the story may be that |
dumb luck that nucleo is working then? Is this in ST's court now? |
If all this analysis is correct, @DrynnBavis can modify his own custom target's VTOR setup to match a working STM target. Or just take the ST should look at the inconsistency between their targets. |
I have a very similar issue with my custom bootloader based on the SAML21J18. I can use mbed-trace just fine; however, when I use SDBlockDevice the application fails to start. Should I create a new issue before I go into any more detail here? |
New issue please. Reference this one if you want. |
@kjbracey-arm or @40Grit (your time zone matches mine a little better), what code am I modifying here? Can you give me a file name and line number? Thanks |
I the comment linked above should give a starting point. |
Somehow missed this on first read, thanks @loverdeg-ep. I've got to leave for something right now but I'll try this when I return in a few hours. |
@DrynnBavis and we actually are probably in the same timezone. I just get up super early and watch github. It is the only chance I have to get in contact with the experts in Oulu. |
YES. IT'S WORKING!!! @kjbracey-arm I tried using
This seems to have fixed my troubles for now. Will having this commented out cause any problems in the future? Curious to know why this works on the ZH chip but not the RH chip... To confirm: I did indeed copy |
What's the best protocol here in terms of closing this issue? Is there a PR I can make to give a better target definition for the RH chips? I'm also still interested in exactly why this solved the problem. So I uncommented that section (back to the faulty code) and set a break point right before this point:
So because Does this have anything to do with me using an LSI rather than an LSE? |
I've been through this type of exercise multiple times myself and had the Cortex-M user-manual next to me when writing custom boot-loaders. I go step by step through the assembly and watch the status of the register. This usually gets me the type of understanding you are looking for with issues like these. Even still I don't know the architecture well enough to diagnose this without spending a couple hours of analysis. Figure out who is most active from ST in Mbed and copy them on this thread. Or open a ticket with ST and point them to this issue. However It will probably fall to you to learn the "rule book" and definitively prove exactly why two parts which should? have the same memory map, programmed with the same binary behave differently. The next step i would take is prove that both parts have all the same peripherals mapped to all the same places in memory. If that is true I would think the binary should work the same in both as long as neither part were relying on any external signals. |
There's no evidence there's any chip difference here is there? The code sets VTOR wrong, but gets away with it if (a) there is no bootloader (so "wrong" happens to be right), or (b) the bootloader has an RTOS and the memory layout is the same (or similar enough) Presumably in all your ZH builds, you've ended up with images with matching RTOS memory layout. If adding/removing trace doesn't affect the ZH, then there's likely no deep meaningful significance - some difference in link order/padding means trace ends up not shifting |
Very behind on my work now since this issue came up, so first I'll have to work on that. But later this week, if I've the time then I'll definitely look into it. |
@DrynnBavis @kjbracey-arm - sorry I was away when copied here and I missed it. Anyway I agree that we probably have to clean-up thinkgs a little bit now. To make things clear, does you custom target boots ok if you're modifying system_clock.c file as below rather than commenting out those lines:
|
In an ideal world, you shouldn't need to set The catch is that a bootloader might have entered your reset vector without adjusting VTOR. Our bootloader does set it, but others might not. So setting it seems reasonable. |
This implementation was introduced back in 2017 here: #3798 |
I've just had a chat with someone who is looking at booting into Mbed OS from another bootloader altogether, and he believes that it doesn't set VTOR, so better safe than sorry. (Mind you, it doesn't even set MSP or CONTROL correctly either, so you'd probably need to add more...). |
@kjbracey-arm Maybe this is naive but it would be nice if there were some standards surrounding this stuff. "CMSIS-boot" |
Well, if the bootloader is jumping into the vector table reset handler of a "raw" standalone image, then it's effectively avoiding having its own standard and using the "reset" standard. In which case it's kind of the bootloader's responsibility to have the chip in perfect reset state, right? So by that standard, we shouldn't have to write VTOR. But in practice, the main image is the malleable one while bootloaders tend to be locked in, so it's the main image that has to cope with whatever bootloaders do. |
The fact that they are "locked in" I'd say is more reason to develop a standard. That way an application could be better prepared to know what it is dealing with. I'll keep thinking on this one and find the right place to converse further. |
@LMESTM To confirm, no that macro |
For future readers that have scrolled this far looking for a solution: the fix for me was to comment out that entire chunk of code in my comment just above this inside of the Thank you everyone involved on debugging this. Quite the struggle but I'm really happy we actually came to a fix. I'm going to close this issue now as the problem has a working solution. Though I'll still be following here (or in another thread if we do that instead) |
Uh oh!
There was an error while loading. Please reload this page.
Description
The following is some barebones code I'm using to just boot to the main app space at
POST_APPLICATION_ADDR
:This code works, and I successfully boot from the bootloader into the main app space as expected. However, once I uncomment
mbed_trace_init();
, my app hangs after the bootloader and is unable to start the main app.Using gdb, I've found that the fault seems to arrive at some point within
__scvKernelInitialize ()
. This is the disassemble print out of its definition:From there, a few things happen inside of irq_cm4f.S before hitting an exception on line 93 of said file. After that
except.S
is called from there a final fault message is seen in the debugger:Question
What's causing the fault within
irq_cm4f.S
?Issue request type
My custom target uses an STM32F413RH chip, but inside
custom_targets.json
I've specified the device name to be"device_name": "STM32F413ZH"
because this has sector information filled out and is virtually the same processor but with a larger package / more pins.Target: Custom target with STM32F413RH processor
Toolchain: GCC_ARM 8.2.1
Tool: mbed-cli
Vers: mbed-os 5.13
The text was updated successfully, but these errors were encountered: