-
Notifications
You must be signed in to change notification settings - Fork 5.2k
BUG: Bad page map in process / BUG: KASAN: wild-memory-access; suspected DMA issue #5138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interesting - could this be due to the same underlying issue as #5133? |
I've read this entire issue and the logs, and I couldn't find anything that relates to #5133 I'm not entirely sure how the kernel structure is built for Pi, so don't take my words as something 100% correct, I could be wrong here, but this doesn't seem related to me, considering that my logs were twice almost exactly the same, while here the logs structure itself is vastly different. I could be wrong though and those 2 issues could be related, I just couldn't find any common error messages with my issue. Moreover, steps to reproduce this issue are to disable swap. I'm using 2GB swap for months now, far before this or my issue happened. |
The common factors are "camera" and "memory corruption". |
Right, this could be related, as I said, don't take my word for it, I just wanted to show my perspective, but those 2 issues could be directly related aswell, I have also updated my issue with additional data. |
I set up a third device, CM4, Kernel 5.19.1, reproducer does not take long. Can I do something to produce more meaningful debug traces?
|
I think |
No, olokos. Spamming these issues with megabytes of logs won't get us anywhere. I am able to find stacktraces, and I can annotate the line numbers because I compile the kernel myself. My issue has a working reproducer. What I was asking was if there was a hint to produce better logs by enabling more kernel debug options, i.e. I tried the I will wait for someone from the kernel team to hear if they can reproduce. |
We're still in holiday season, but this issue will get some attention in the not too distant future. |
@kralo I am not spamming issues with logs, on the contrary, I mostly included files directly in the comment, so there's no wall of logs as you open my issue. I also often use hastebin/pastebin in order to not clutter the issue page, there are many options out there! ;) You asked for a hint to produce better logs, but you never provided logs what happened before, which I specifically asked for. That makes it much more difficult to compare your logs to mine, on top of that also Your log doesn't seem to be a typical DMESG output, which IMO makes it easier to see the time differences between calls. We're using the same Pi, we're both connecting the camera directly to the Pi with flex tape, except that you use vivid and I use libcamera + libcamera-apps, but they most likely still use similar kernel functions to operate. Sometimes stack trace is a direct result of what has happened before, which you have chosen not to include, for reasons unknown to me. The only reason why I'm even commenting here, is that Pelwell pointed out that this issue might be related to mine and all I want is for my camera stream to be very stable and reliable, otherwise I wouldn't write this comment. |
I compiled the kernel with
As my Application (and my reproducer, willfully) do, they "stall" the buffer pipeline (by In bcm2835-unicam.c, there is this
This buffer, which on my RPi4CM is 4K, is where the CSI-Frame-Handler's write address gets directed to, when there is no (vb2-)buf ready. But will the
I could not find clear documentation about this. Someone with access to more verbose videocore iv specification should check this. As a quick test, I have increased the size of the dummy buffer, by Out-of-bounds write by DMA would be such a nice explanation... |
That's a useful finding. It would be even better if you could initialise the start of the second page of the dummy buffer with a fixed pattern, and check periodically that it hasn't changed - something like this: diff --git a/drivers/media/platform/bcm2835/bcm2835-unicam.c b/drivers/media/platform/bcm2835/bcm2835-unicam.c
index cad7f018b221d..a3c434e74900a 100644
--- a/drivers/media/platform/bcm2835/bcm2835-unicam.c
+++ b/drivers/media/platform/bcm2835/bcm2835-unicam.c
@@ -131,6 +131,11 @@ MODULE_PARM_DESC(media_controller, "Use media controller API");
* allocation works in units of page sizes.
*/
#define DUMMY_BUF_SIZE (PAGE_SIZE)
+#define PADDED_DUMMY_BUF_SIZE (PAGE_SIZE * 230)
+
+static const uint32_t guard_pattern[] = {
+ 0x5a5a1234, 0xffffffff, 0xffffffff, 0x5678a5a5
+};
enum pad_types {
IMAGE_PAD,
@@ -852,6 +857,12 @@ static void unicam_schedule_dummy_buffer(struct unicam_node *node)
static void unicam_process_buffer_complete(struct unicam_node *node,
unsigned int sequence)
{
+ if (memcmp(node->dummy_buf_cpu_addr + DUMMY_BUF_SIZE, guard_pattern, sizeof(guard_pattern)))
+ {
+ pr_err("%s: guard pattern corrupted:\n%*phN\n", __func__,
+ sizeof(guard_pattern), node->dummy_buf_cpu_addr + DUMMY_BUF_SIZE);
+ memcpy(node->dummy_buf_cpu_addr + DUMMY_BUF_SIZE, guard_pattern, sizeof(guard_pattern));
+ }
node->cur_frm->vb.field = node->m_fmt.field;
node->cur_frm->vb.sequence = sequence;
@@ -3004,13 +3015,16 @@ static int register_node(struct unicam_device *unicam, struct unicam_node *node,
media_entity_pads_init(&vdev->entity, 1, &node->pad);
node->dummy_buf_cpu_addr = dma_alloc_coherent(&unicam->pdev->dev,
- DUMMY_BUF_SIZE,
+ PADDED_DUMMY_BUF_SIZE,
&node->dummy_buf_dma_addr,
GFP_KERNEL);
if (!node->dummy_buf_cpu_addr) {
unicam_err(unicam, "Unable to allocate dummy buffer.\n");
return -ENOMEM;
}
+
+ memcpy(node->dummy_buf_cpu_addr + DUMMY_BUF_SIZE, guard_pattern, sizeof(guard_pattern));
+
if (!unicam->mc_api) {
if (pad_id == METADATA_PAD ||
!v4l2_subdev_has_op(unicam->sensor, video, s_std)) {
@@ -3104,7 +3118,7 @@ static void unregister_nodes(struct unicam_device *unicam)
struct unicam_node *node = &unicam->node[i];
if (node->dummy_buf_cpu_addr) {
- dma_free_coherent(&unicam->pdev->dev, DUMMY_BUF_SIZE,
+ dma_free_coherent(&unicam->pdev->dev, PADDED_DUMMY_BUF_SIZE,
node->dummy_buf_cpu_addr,
node->dummy_buf_dma_addr);
} |
[ Patch updated - it was missing the |
IIRC The buffer end address in Unicam was checked at the end of a line. @naushir is out of the office until Tuesday, but we'll discuss it then. It should be noted that the dummy buffer will only ever be written to in the event of a dropped frame due to buffers not being cycled correctly / fast enough. |
Some of these call stacks look suspiciously similar to those in raspberrypi/rpicam-apps#246. For that issue there was a suspicion it was power related, and using
This should be the expected behaviour of the hardware, according to the spec. Clearly this may not be true based on your observations, so there may be some hardware bug that is in play. Note that the dummy buffer will be used when there are frame drops occurring. This is not entirely uncommon, so the overwrite does not seem to be happening on all occasions or we would see this much more often. Perhaps there's some other interaction when the system/memory bus is heavily loaded?
I would have hoped that the Unicam HW checks the buffer overwrite possibility on every AXI burst operation rather than just at the line boundary, but there is no way to confirm this. If that were the case, the image width would not matter for sizing the dummy buffer.
Assuming the HW has a bug, this feels more like the right solution here - maybe even size the buffer |
Predictably, I cannot reproduce the crash (in the 4-5 hours it has been running so far) with the rpidmareproducer script! However, I am running a very different configuration base:
I also have |
Right, so I definitely see the overruns in the dummy buffer occurring when adding/testing guard-words on the end of the buffer. I have not reproduced the overrun with libcamera, only with the provided The good news is that doubling the size of the dummy buffer and providing Unicam with half that size seems to "fix" the overrun - i.e. overruns still occur, but we do not trample outside of our allocation. @kralo would you be able to try out this change from #5157 and let me know if you still get crashes with your setup? |
I can reproduce the errors when checking with the guard patterns, this was an excellent idea. However, to anyone trying this I recommend checking for (memcmp () != 0)
This yields
which clearly is camera pixel data, as my ov9281 ist set to generate the testpattern, i.e. So I will now run the test for a couple hours and report in the pull req. |
Thanks for the memcmp correction (the code wasn't even compiled) - I've updated the code fragment to avoid confusion. |
Describe the bug
I have an application that does image analysis from an ov9281 sensor. Sometimes it would hang and produce traces like the following.
I have then tried to isolate and written a reproducer (below).
This happens on Kernels 5.10.110 and 6.0.0-rc1.
I suspect this has to do with the DMA/vc_sm_cma part of the camera image aquisition, because it does not happen with a "virtual" camera from the vivid driver
Very often the bad page map is around pmd:800000001801003 . PTE does always seem to be different.
The crashes with the reproducer happen around 1-3 times/hour, with the vivid driver it runs for 10 hours straight, when I terminated due to lack of patience.
Steps to reproduce the behaviour
I have the suspicion, that this is more easily triggered when memory is tight, so
dtoverlay=ov9281
, remove kernel security/memory address "fog",gcc -o rpidmareproducer rpidmareproducer.c
rpidmareproducer.c.txt
NB: I have blacklisted,
rpivid_hevc, bcm2835_codec, bcm2835-isp, bcm2835_v4l2, bcm2835_mmal_vchiq
thus, I do not suspect the issue to be there.I have found to be able to best reproduce when using boot loops, so I autostart the reproducer ( ".config/autostart/repro.desktop")
The reproducer works with 1280x720 as image format, this can also be supplied by the vivid driver. Execute on every reboot to set the camera correctly:
What the reproducer does:
It starts streaming from the camera and sometimes executes a syscall . It seems that this is when the system tries to copy pages and fails.
watch -n 15 shutdown -c
to cancel all occurring reboot requests.)My kernels are compiled from the rpi repo with additional debug options, e.g. KASAN. To use the vivid driver, enable Kernel Option
CONFIG_VIDEO_VIVID=m
.If you want to see the reproducer run for ours without issue, remove the dtoverlay, and
modprobe vivid
.Device (s)
Raspberry Pi 4 Mod. B, Raspberry Pi CM4 Lite
System
Raspberry Pi reference 2022-04-04
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 27a8050c3c06e567c794620394a8c2d74262a516, stage2
Aug 9 2022 13:44:40
Copyright (c) 2012 Broadcom
version 273b410636cf8854ca35af91fd738a3d5f8b39b6 (clean) (release) (start)
5.10.110-v8-g2d80ef99861c SMP PREEMPT Fri Aug 19 09:08:48 UTC 2022 aarch64 GNU/Linux
6.0.0-rc1-v8-gc8f41281d1f4
More info in raspinfo.txt
Logs
These redzone-overwritten messages hint to something in the memory code being wrong:
Left Redzone, CM4 , 6.0.0-rc1
KASAN null-ptr-deref, CM4, 5.10.110
Bad page map, RPI4 B, 6.0.0-rc1
RPI 4B, 6.0.0-rc1, page allocation failure
Additional context
+cc @naushir @pelwell
The text was updated successfully, but these errors were encountered: