-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[Kernel5.4] Lowering arm_freq_min leads to system hang/crash #1431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can reproduce this. The backtraces in kernel seemed pretty random to me, so probably a clock/voltage issue, rather than a kernel bug. |
You mean you "can" or you "can't" reproduce it? The issue is present with default clocks+voltage in my case, with only the minimum arm clock reduced and never ever any voltage warnings even when overclocked. We probably just found a second case with RPi Zero. Probably related as well: https://www.raspberrypi.org/forums/viewtopic.php?p=1685668#p1685668 |
I can see the crash on a Pi3+. I couldn't provoke it on a Pi4. Workaround for now is to disable the arm_freq_min. I'll let you know when it's safe to add back in. |
Strange only that pre-5.4 the same large clocks jump was never an issue and with only lowest and highest clocks as only two pstates the jump was always the largest possible. One could make a test with adjusting |
Had the same problem on a model 1b from 2013 that runs off a battery bank and solar. Don't remember why I had arm_freq_min set so low since it did not make a huge difference in power draw. When I updated yesturday to 5.4 and then rebooted everything seemed to be fine. Ran top and within about 30 seconds the screen froze. I then used the default config.txt and after a reboot there was no more freezing when cpu load increased. Was going to continue trouble shooting today but checked the github issues and bingo Michalng saved me some time, thankyou. |
May I express the urgency I see in resolving or working around this bug? This has the potential to destroy systems by causing file corruption in unconditionally crashed services, e.g. databases and similar. E.g. Let me know if there is anything I can test to help getting this resolved quickly. |
We have a workaround in latest rpi-update firmware that will disallow arm_freq_min below 600. |
Kind of a noob here: is this fix available now (using apt-update/upgrade)? |
No, you will need to use rpi-update. No schedule on apt. |
Ok thanks |
And remember that this is only a workaround for user which are not yet aware of the issue. In your case you simply |
So If i just comment out my arm_freq_min=300, it will go to the default 600 right? And I can change it back once a full update is out that fixes this? (I'm not using rpi-update, I'm waiting for apt, the warning scared me 🙂) |
Yes exactly. On RPi1+Zero it's 700 Mhz but all defaults work fine. |
Maybe this is the wrong place to ask, but how do I know when a version is out that fixes this? |
Subscribe to this issue, I'm sure we'll get a dev notice once a real fix is merged and I'll anyway keep an eye on it as well and search through release commits when I recognise them and will post here in case. |
There is a proper fix for this as in internal PR. I'll let you know when it reached rpi-update. |
firmware: pi4: allow pllb changes while running See: #1431 firmware: board_info: Give the CUSTOM boards the PMIC_NCP6343 trait firmware: dispmanx/displays: Allow both DPI and DSI displays simultaneously firmware: imx477: Release the I2C semaphore once finished, not before
firmware: pi4: allow pllb changes while running See: raspberrypi/firmware#1431 firmware: board_info: Give the CUSTOM boards the PMIC_NCP6343 trait firmware: dispmanx/displays: Allow both DPI and DSI displays simultaneously firmware: imx477: Release the I2C semaphore once finished, not before
Latest rpi-update firmware contains a fix for this issue that doesn't involve limiting arm_freq_min. Please update and test. |
Jep seems to work fine. Just tested on RPi2 with
|
That's great! But as I understand, rpi-update is for pre-release stuff right? It will come to apt eventually? |
Yes, |
On Raspberry Pi 4,
Just a single case for now, but probably the same/similar underlying issue. |
I am doing some more tests, this time with ahk timing the execution of the date command, to 5 seconds.
I used the following ahk script: #NoEnv ; Recommended for performance and compatibility with future AutoHotkey releases.
; #Warn ; Enable warnings to assist with detecting common errors.
SendMode Input ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir% ; Ensures a consistent starting directory.
#MaxThreadsPerHotkey 2
SetBatchLines -1
F17::
SendInput {Raw}vcgencmd measure_clock arm`n
;Wait for command output
Sleep, 500
SendInput {Raw}date +`%H:`%M:`%S:`%N`n
Sleep, 5000
SendInput {Raw}date +`%H:`%M:`%S:`%N`n |
Unfortunately by now we have more than 4K rpi2 v1.2 and v1.1 installed which are having continuous blocking problems and needing more reboots to restart, Please, let me know if you need more information than what @MichaIng has already given you, which seems to me already very complete and exhaustive. This is a working rpi2:
You can see that with this config.txt and the older kernel RPI2 is able to work without any block at 350000 that is our required minimum frequency. I apologize to the readers, but I must express my disappointment: |
You can btw use the latest 4.19.x kernel, no need to stay with 4.9.x. The problem started with the intermediate frequency steps implemented with 5.4.x, while previous kernel versions only jump between min and max without any intermediate. Here the latest commit which you should be able to use: https://github.com/Hexxeh/rpi-firmware/tree/866751bfd023e72bd96a8225cf567e03c334ecc4 rpi-update 866751bfd023e72bd96a8225cf567e03c334ecc4 or the latest packages for Raspbian Stretch:
With the current kernel, you can alternatively try to reduce voltage.
Not sure if it works as good as 350 MHz idle frequency regarding power consumption/temperature, but worth to give it a try. Another thing is the scaling governor. While |
Thanks Micha, unfortunately this is a problem that we can also encounter in some versions of the 4.19.X kernel, and it is only thanks to you that you opened this issue that we understood that the problem could be related to the scaling governor and kernel version. Following your advice, (before I saw your previous message), I tried to change from 4.9.35 to 4.19.118 (rpi-update e1050e94821a70b2e4c72b318d6c6c968552e9a2) , but as soon as I did I immediately had total system freezes with no lines written in any log. I must point out that:
I tryed using -2 as overvoltage, but results in a no measurable working temperature reduction. Then I tryed -4 and I saw a little temperature reduction, but maybe it is more related to the external environment temperature reduction. I see that in a previous post you wrote that you managed to lower the frequency in a 5.4 commit on 24 August 2020 (#1431 (comment)) You think it's worth a try ? Thank you! |
To be true I'm not sure whether this commit has even been released and tested in production environment. Probably the APT packages are a better go then, as those have and still are used on most production Raspbian Stretch systems. But if that 4.19.80 commit works stable now, probably not worth to change something about that 🙂.
Okay that is oldoldstable and soon oldoldoldstable already (does this suite codename even exist? 😄). I think 4.19 kernels have never been tested on Jessie systems, so the freeze you mentioned might even be related to that, e.g. the old systemd (init system) version being related or so, not sure.
Especially for temperature with precision and error range it is of course difficult to measure significant changes without laboratory conditions, measuring the power consumption over a longer period might work better. But yes I agree that lowering the frequency definitely had an effect and as well allowed to further lower
Yes it worked there, when not reducing voltage, but test it thoroughly before applying to production as those commits are in the middle between stable releases. The latest commit which still allows to lower the frequency on RPi 0-3 is: rpi-update cc9ff6c7d1b9be5465c24c75941b049f94a6bd32 The next commit disabled it: https://github.com/Hexxeh/rpi-firmware/commits/ab9d6874ff67f7ef015d04358ad1e7711abe3f20 |
You are absolutely right. I don't need multistep scaling, a switch between min and max would be enough if it worked. It would still be better than nothing. Thanks @MichaIng for your support. |
I did further test starting from 4.9.35 and going up to 5.4.77 I found strange data, I hope this data will be useful to you more experts. For the following tests I raised and lowered the ambient temperature in order to reach the throttled and capped states. in config.txt: I have a script that every 5 seconds write in a file the following informations: this is the results with official 4.9.35 kernel with no cpu load. It is as expected.
Starting with kernel 4.14.x we can see that the CPU temperature is on average 5 degrees lower than the Core temperature. Is it possible in the same chip? is it true?
At this point I noted a commit from @popcornmix (bdb826a8db75ba36d754bd71fb64d3905d3bd026) that have the following description (1st row):
Starting from this commit, as soon as the rpi2 go to throttled state (0x20002 or 0x60002) the ARM frequency get crazy even if there is no cpu load:
There is no overclock, but ARM Freq is read as 1148000KHz ????? shouldn't the maximum be 900Mhz on rpi2? Here the same with kernel 5.4.77 (here is arm_freq_min=300), works as expected during ARM Capped (0x60006) or not throttled state (0x0, 0x60000 or 0x20000), but the ARM frequency goes crazy when it is in the throttled state (0x20002 or 0x60002). This throttled state seems where all my RPIs keep crashing (sometime after 2 hrs, sometime after 2 days)
In this moment I'm doing the last tests on 4.14.21 to 4.14.24 because I need a patch for sc16is7xx: Fix for multi-channel stall that was availlable from 4.14.21 Hope this data can be useful to someone. Thank you. |
Interesting find. Which CPU scheduling governor did you use for these tests? Note that only reading the temps/stats from those files and especially executing The 1148000KHz at lower kernel versions indeed looks like a bug, but since kernels 5.4.77 and up do not show this behaviour, I guess there is no motivation to fix or even investigate this. Especially since I think vast parts of the scaling driver have been moved to the upstream implementation, hence the old code the commits refer to are likely gone completely. Finally, I'd love to have a testing branch with lowering |
Governor is still the default, I tryed to keep as standard as possible. For what I understood:
at least this is how 4.9.35 behaves. I wrote 'gets "crazy"' because it rises for no apparent reason and the calls to vcgencmd are always the same (same load). The test of 5.4.77 was a one shot. It froze very soon, so I don't know if the frequency rises more than that, but it has the same "strange" behaviour when in throttled state. For vcgencmd get_throttled the bit map is the following:
Let me know if you need more data. |
Done some more testing. I can confirm that
This is the log using 4.14.24 (2659c9e87b574b3b05eacef80961c404ed0f0ce3), the last working: Here you can see how the ARM frequency behaves before, during and after a throttled state
In the following lines you can see what happens when you enter in Capped state.
@popcornmix What is changed in this your commit regarding the frequency/voltage scaling logic? Attached you can find the complete day of this log with timestamps |
The commit message for the "Rework the frequency/voltage scaling logic" is:
and it is a rewrite that consisted of 35 commits. It's not something that can be trivially described. But I'm not sure what the value is in debugging a 3 year old version of the firmware. I know arm_freq cannot be reduced below 600MHz. That is an issue we don't currently have a solution for (without introducing instability), and so it is currently unsupported. |
Reading costs lot less time than writing, you could have made a little effort, given all the time I'm wasting on it.
But if a problem in the frequency management of the cpu is not important to you... Okay. Unfortunately, with the 5.10.x kernel you have introduced another bug related to the SPI and the management of the SC16IS752 chip (cpu 100% always), so at the moment it is not an option for us and it makes no sense do more tests for me. This would be your job, not mine, given your careless answer I think I have already wasted too much time on it
Due to this other SPI bug THERE IS NO KERNEL we can use if above 4.19.24 !!!!!!!!!!!!!!!!!!!! BUT I see from your too hasty (and I think useless) answer that you don't worry about it, you don't care about it and instead of collaborating, you prefer to hide behind the formality of what is officially supported. Playng at your role game we need to remind you that we have purchased more than 4000 pieces almost exclusively for:
Your unilateral decision to remove this vital feature for us on March 31, 2021 (#1431 (comment)), without giving any prior notice and also clearly showing unwillingness to restore it, is clearly a violation of our rights as customers. At this time it is not possible for us to use the products that we have regularly purchased and paid for, without accepting important security risks, which is not acceptable. This reinforces my opinion that, as mentioned in a previous post that raspberrypi deleted, the Raspberry PI is not ready yet and is not intended for a market other than retro gamers or hobby projects, despite the time that has elapsed since RPI1 Have a nice day. |
It's about scale/gearing - there are many of you and very few of us. You are already diluting the utility of our time by requesting support for an obsolete version of our kernel, and you clearly know your own situation, so I don't consider asking for a concise restatement of a problem or requirement to be disrespectful. On the whole I think we do pretty well on the respect front, even with some of the more challenging members of the community (and I'm thinking of nobody in particular, and definitely not you). |
So, I have read this thread, this is the precis I get (Ignoring the SPI stuff - that seems to be a different issues and should have its own thread) Customer is using a 2B, and it appears to be right on the edge of acceptable thermals. Customer has reduced the temperature limit to 70degC, reason is unclear. To further reduce temperature of device installation they wish to reduce the min ARM frequency below the current 600 minimum, as this apparently reduced temperatures by up to 5degC. Unfortunately, reducing the minimum frequency is an unusual use case, the vast majority of users want more power, not less. AIUI, the current frequency management is targeted more at the majority, which seems the obvious choice given the limited HW support for all the frequencies required to be generated. I don't know what the options are here; to break the frequency management for most users to satisfy this use case seems counterproductive. I think it would be worth the customer retesting temperatures )if they haven't already) with the latest kernel and firmware, as it's certainly possible that the better management of the device with this combination may well give better results than dropping the ARM frequency. I'm not sure why the high temp limit has been reduced to 70, I would also try things with that removed. |
I urge everyone to stop blaming anyone else for anything else and keeping this thread productive. Development time is limited and given the outstanding large user space of RPi vs any other SBC manufacturer, the Raspberry Pi foundation does an awesome job, obviously 🌞! Many other manufacturers do (nearly) no official kernel and/or userspace development, just to make clear what to compare with. Back on the actual issue:
Re-enabling an optional feature does not effect anyone else and does not "break the frequency management for most users".
I would love to, i.e. comparing power consumption and temperatures with and without ARM frequency lowered on non-RPi4 with latest kernel, but since the feature is disabled, I cannot 🤔. |
AIUI, the frequency management in the firmware has changed, a lot. As popcornmix states above, reintroducing the ability to go below 600 causes instability, for which we do not have a fix, without affecting the other frequency management. That's the point. Other stuff will suffer if this feature is reintroduced. The whole area of frequency management, thermal throttling etc is very complex, and made even more complicated because of the small number of PLL's available, and the large number of set frequencies we need to generate for the various peripherals. If it were as easy as "turn it back on", we would have done that. If it was as easy as "turn it back on and tweak a few things" we would also have done that! My point about testing was does the latest firmware without the <600 feature match the power consumption of the previous firmware with the 600 wghen dropped lower. That CAN be tested. In which case you don;t need the <600 feature to match the consumption you had before. I have no idea if that will be the case, but the whole point of these management changes is to reduce power requirements overall. |
Thanks for clarification!
More than with the last 5.4 kernel which allowed it? Since with that one it was stable for me. But I guess the assessment of instability is based on more than my test results, shared further above that time. And is it so much different on RPi4, where it is still possible, despite the instability that can be in fact caused with it as well, as linked? If there is an obvious hardware limitation on older RPi models, regarding the number of PLL's, then we need to accept it. I mean of course with kernel 4.19 => 5.4 the major change was the implementation of the intermediate frequency states, having many new states especially with reduced min frequency.
Ah okay, above was stated:
Probably I find time to test this as well. But actually, even if the newer kernel would be more efficient elsewhere, this would not break the argument, as lowering power consumption and heat dissipation further would be still better. I wouldn't see this as user-specific question, whether it is "sufficient" or not as it is now, but it would be an enhancement in every case. But of course I cannot evaluate whether this can be achieved with acceptable effort and satisfying outcome overall. |
I believe there are more PLL's on the 2711, BUT there are also more peripherals to supply frequencies for. So there is still some juggling required to sort out all the clocks. The rule is generally, when you have n PLL's the number you actually need is n+1. PLL's take up a lot of silicon though which is why we always get n where n is too small. I'm not expert of the link between the firmware and the Linux kernel frequency management, that's popcornmix, but the ultimate arbiter is the firmware has control as it need to ensure temperatures don't get too high. |
All the data I produced here are related to tests conducted for this specific issue on 5.4.x kernel ([Kernel5.4] Lowering arm_freq_min leads to system hang/crash), I starded 1 year ago to test. The throttling problem is related to this because during all my tests the 5.4.x firmware hang/crash (as well as for all the other kernels tested) seemed more related to the change of state of the throttling, for this reason I conducted more specific tests on those states and I noticed that there was a problem there, this is why I decided to report the collected data on this issue. I think that there is no need to have many frequencies, if this is the problem of instability. SPI has nothing to do with it here, it's just to say I can't test 5.10.x. for other reasons, but this issue is still for 5.4.x Nothing against raspberry pi and I understand and appreciate all the reasons and merits, but I have to draw my conclusions and act accordingly. Thanks to all. |
Nothing to do with the problem, but in order to improve our support, it would be interesting to know why you feel you have been ignored, or you think replies have been hasty. Reading this thread, your posts have been replied to fairly swiftly by the right engineers, albeit not with the results you were hoping for. As for the hasty answers, I've also read those and whilst brief, they all appear to be accurate. Our engineers are spread very thin, that means replies can be brief as we have a lot of work to do. So would be interested in the reasoning if you have time to comment. |
Thank you for the question even though I think we bore everyone else here, it is difficult to answer this question without entering into controversy. I did not understand which of the answers you are referring to, they probably answered quickly, but an answer should first be reasoned and then also filled with content. After all the tests I have carried out (1 year) I stated in my messages, with absolute certainty, that:
But the answer @popcornmix gave me clearly demonstrates that he considers the problem unimportant to take the time to investigate the matter, and (as if to tease me) also asks me: "Can you explain simply what the problem is with the latest firmware / kernel?" forgetting that this issue is about 5.4.x and the issue is always the title of this issue. All the answers received (except the one from @MichaIng) were limited to highlighting my inaccuracies (as ignorant as I am), but no real answer, concrete solution or at last a will to do was given apart something that sounds like: "we have blocked it, now the problem is not there, so don't bother us anymore" Some of the reasons I read are even more worrying. So at the end with the answers I revceived from you I realized that:
I apologize for my bad English, I hope I have been able to explain to you the reason for my frustration. Have a nice day. |
I still think it makes sense to split the issue up, as the collected information refers to different kernel versions, different RPi models, some related to reduced idle frequency, others not. Sure they could be related, but it might be easier to investigate each issue isolated, to allow engineers better reproduce or get targeted debug results. My suggestion would be:
Does this make sense? And to help your particular case @TheyKilledKenny, did you go through all other possibilities to disable hardware features and lower power consumption (== heat dissipation) by this? As a related question came up, I collected things I applied on my headless RPi2 here: #1577 (comment) |
@MichaIng I could retest in a week. Personal contraints make it currently impossible for me |
Thank you very much. Thank you. FYI in my particular use case, not importat for this issue: I have already tried your valuable suggestions including undervoltage, but even if I have achieved some small results, the only real way that gives concrete results to keep the temperature low is to lower the working frequency. I'm not a fan of under/overvoltage. Working with many pieces, I believe that components should work at the specified voltage to avoid possible fluctuations in manufacturing error tolerances, but I tried that too. Really thank you. |
That is a valid concern indeed, I haven't thought about. I'm pretty sure the stable lower voltage limit is depending on the individual device, and actual issues of too low voltage can include randomly occurring data loss (especially on USB drives), so that needs to be tested and by times monitored on the single system, not really applicable on a farm. |
Describe the bug
I was upgrading to the newest firmware + kernel packages, which resulted in system hangs and/or crashes. I narrowed down the issue to
arm_freq_min
which I lowered to150
or300
(tested both) to allow the system clocking below 600 Mhz. Commenting the setting leads to a stable system, setting/reducing it leads to a quickly hanging or crashing system.To reproduce
Raspberry Pi 2 Model B Rev 1.1
to current package release5.4.51-v7+
.arm_freq_min
to300
(gpu_mem=16
, if relevant)vcgencmd measure_clock gpu
.Expected behaviour
Add a clear and concise description of what you expected to happen.
Actual behaviour
Setting
arm_freq_min
to300
should not lead to system crashes.System
Copy and paste the results of the raspinfo command in to this section. Alternatively, copy and paste a pastebin link, or add answers to the following questions:
Raspberry Pi 2 Model B Rev 1.1
cat /etc/rpi-issue
)?Raspbian GNU/Linux bullseye/sid
vcgencmd version
)?version 21a15cb094f41c7506ad65d2cb9b29c550693057 (clean) (release) (start_cd)
uname -a
)?Logs
Additional context
This is new and probably the reason for the crashes when lowering minimum frequency. When leaving at 600, there are only two pstates 600 and 900 and with kernel 4.19 there are always only two.
I was actually hoping for that feature, so great work, however sadly at least my RPi model does not work fine with it.
The text was updated successfully, but these errors were encountered: