-
Notifications
You must be signed in to change notification settings - Fork 5.2k
repeatable w1_therm module crash #872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you explain what you mean by "while the device is deleted from the system"? This sounds like a bug in the driver, which is something we take from the upstream kernel unmodified. Our resources are limited, so it is likely you will have to raise the bug with the authors and maintainers of the driver. |
Ok, thank you for your answer, I will do it. If I "hot unplug" the temperature probe while I read/wait the sys file /sys/bus/w1/devices/28-xxxxxxxxxxxx/w1_slave (select/read) this module bug appears. I think the one wire driver calls w1_therm_remove_slave before w1_slave_show terminates (I'm not a pro in driver development, but I assume this is that or is approaching the reason of the bug). Here are the two kernel functions of w1_therm which may cause the failure:
|
The module developer will patch the w1_temp.c file, and the bug is exactly what I was trying to explain. Simple solution : one mutex which disable the access to sl->family_data if w1_slave_show is running. |
Good work. Obviously we'll take the patch when it appears. |
Was the patch applied? I can reproduce the bug in the 3.18.11+ kernel. And I am also tripping it in 4.0.7+ |
Thanks. Do you mean to cherry pick a binary or cherry pick the source code. (I don't have an environment set up for building from source and wonder if it would take many, many hours to do that since the kernel model is distributed with the kernel.) If I grab the binary I guess I would need to find an apt repository for the rc version. Is there one? |
No apology needed - I was suggesting that we could back-port the patch. In fact I've already done that, and I'm in the process of trying to reproduce the problem so I can verify that it is fixed before submitting a Pull Request. |
I can't reproduce this - I'm on 4.0.7 on a Pi 2. How critical is the timing? |
To attack from the other direction, here is the patched w1_therm.ko compiled for 4.0.7-v7+ (i.e. Pi 2). If somebody can verify that this no longer crashes that would be helpful. |
I'm not sure what to say about timing. Within two hours of booting up my Pi I usually get a kernel NULL pointer dereference. I don't have a Pi 2. So maybe it's predictable that this binary leads to the following message in kern.log:
I'm guessing the module_layout refers to compilation parameters and those are in the vermagic info which is different for the two versions of the compiled module:
Patched module for Pi 2:
If you compile it for Pi 1 (preempt ARMv6?) or show me how to do that then I'll give it a whirl. |
Pi 1 version here. |
Excellent. It's running now. I'll report back in a day or so if it runs without exceptions. |
No kernel faults yet |
I developed the "w1_therm reference count family data" patch. In the original submission I explained that this is a band-aid patch, the final solution is expected to involve switching to the sysfs reference counting, while there is still a possibility for a race, my tests easily crash it without the patch, but not with it. pelwell, this is my test, it simply has one loop manually adding and removing a device, and another trying to read that device, set the variable in multiple shells (giving one of your slave id's, or make one up), then run one loop for each shell, then after a while kill the add/remove. Other variations are to run multiple add/remove and or temperature read loops, but in general the add/remove loops have to be killed otherwise the reads get stuck and don't make progress. slave=28-00000xxxxxxx while true; do echo $slave > /sys/devices/w1_bus_master1/w1_master_add; sleep .1; echo $slave > /sys/devices/w1_bus_master1/w1_master_remove; sleep .1; done while true; do time cat /sys/devices/w1_bus_master1/$slave/w1_slave ; sleep .1; done |
Thanks for that, David. With your test and without the patches I was able to easily recreate the problem. With the patches I haven't managed after trying for over an hour. I've created PR #1059. If a better patch comes along we'll apply that instead, but for now this seems to solve the problem. |
After 2 more days I have no kernel faults. I declare this patch a success. |
Great. I've merged the patches into 4.0.y, and I'll apply them to 4.1.y when my testing is complete. |
As the patch is applied upstream anything further would be on top of this patch. Don't look for it in the near term, no one has volunteered to implement the lower level and more extensive changes Evgeniy Polyakov suggested. I don't even use the w1_therm module anyway, I'm reading the temperature sensors over netlink, which is a non-blocking socket like interface, to the w1 driver. That lets my program send out 14 temperature conversion requests, delay, then issue 14 reads and collect the results, and be responsive to requests while it waits. With w1_therm every sysfs read blocks for the 750ms+ time it takes to do the above, so you either take 750*14 = 10.5 seconds to serially read one after another (and your program can't do anything else) or do multiple threads or processes, all of which is less efficient. |
@jonalibert has your issue been resolved? If so, please close this issue. Thanks. |
I believe the fix for this is merged from my reading above. Closing. |
…de-rust scripts: Exclude Rust compilation units with pahole
This bug is repeatable if you are waiting/reading the device temperature (for example /sys/bus/w1/devices/28-000006157bcd/w1_slave) while the device is deleted from the system.
I think this kernel function is locked (static ssize_t w1_slave_show(struct device *device,
struct device_attribute *attr, char *buf)) while this function (static void w1_therm_remove_slave(struct w1_slave *sl)) make a kfree on sl->family_data. And the locked function accesses to sl->family_data without checking if sl->family_data is to NULL.
Reference to linux-rpi-3.18.y\drivers\w1\slaves\w1_therm.c.
uname -a
Linux raspberrypi 3.18.8+ #765 PREEMPT Thu Mar 5 15:41:59 GMT 2015 armv6l GNU/Linux
Here is the dmesg:
Cheers,
Jonathan ALIBERT
The text was updated successfully, but these errors were encountered: