-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Polling /sys/class/thermal/thermal_zone0/temp eventually results in kernel backtrace #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Reproducible with latest firmware and kernel:
Just run the following test:
And you should see the processes all hang within 20 seconds, and the following stacktrace will appear in syslog/kern.log after a few minutes:
|
Ah, I see an issue. I think we need a mutex around bcm_mailbox_property. |
Update is pushout out. Use rpi-update and test please. |
Out of interest, will this affect me doing mailbox execute? My calls are all synchronous (from my POV) but that doesn't mean I can't fall foul of a mutex ;-) |
Thanks Dom. Hopefully I've picked up the right version:
4 concurrent read loops seem to be progressing nicely, so I think this can be closed. However, I should add that I still notice the occasional bogus value being returned, eg:
Both the second and third values are clearly not valid. Any idea if that needs to be looked at? I see these bogus values even with a single loop, for example when starting a second ssh window you'll see a bogus value appear in the window running the loop. |
The bogus values are a different issue to the hang. The currently disabled message "Failed to get temperature" does get output when we get a spurious value, I could add a retry, which would avoid the problem, but I'd prefer to understand it... |
Sure, understood - I'll open a separate issue for the bogus value just so that it's on the list, though it's a very minor problem and easily worked around with a retry until a sane value is acquired. I've been running the 4-loop torture test for about 4.5 hours now, and there have been no hangs and syslog/kern.log is clean, so I'm going to close this issue as fixed. Many thanks for the quick resolution. |
@simonjhall |
I use your ioctl + IOCTL_MBOX_PROPERTY system, called from user code. |
Yes, that makes use of bcm_mailbox_property function. |
Oooh. I send run lots of VPU jobs, and I could imagine that this could clash if someone had a CPU temp display on their desktop or something. Also, what about the clock speed freq system? |
Yes, on demand cpufreq scheduler will cause mailbox property writes. |
fix from Jan 7 doesn't solve problem. most likely it only stashes it temporally. |
@stupid-boy |
I see this lockup when attempting to read the core temp regularly, since I'm trying to use RPIMonitor... same kernel messages as above; this particular Pi hosed at around midday today, even though it wasn't yet running RPImonitor at that point, so I don't think anything would have been reading temp, but I do have the 'on demand' CPU govenor running... |
Link to Forum Discussion thread
Repeatedly reading/polling
/sys/class/thermal/thermal_zone0/temp
will eventually result in a kernel backtrace and the process attempting to read the temperature will hang. This can occur after a couple of days when polling/sys/class/thermal/thermal_zone0/temp
every two seconds, or sooner if polling more frequently.With the following Raspbian 512MB system:
I ran the following test, commencing Jan 6 17:06
On Jan 7 at 01:48:46, the process became hung:
In syslog and kern.log, the following corresponding entries were found:
In addition, when running two (or more) concurrent tests, the values returned by the thermal driver are often invalid (ie. large, negative values). The likelihood of invalid values being returned by the thermal driver increases with the number of concurrently running tests. This might suggest there is an underlying concurrency issue in the thermal driver.
The text was updated successfully, but these errors were encountered: