Skip to content

BUG: Bad page map in process / BUG: KASAN: wild-memory-access; suspected DMA issue #5138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kralo opened this issue Aug 22, 2022 · 18 comments
Closed

Comments

@kralo
Copy link
Contributor

kralo commented Aug 22, 2022

Describe the bug

I have an application that does image analysis from an ov9281 sensor. Sometimes it would hang and produce traces like the following.

I have then tried to isolate and written a reproducer (below).
This happens on Kernels 5.10.110 and 6.0.0-rc1.

I suspect this has to do with the DMA/vc_sm_cma part of the camera image aquisition, because it does not happen with a "virtual" camera from the vivid driver

Very often the bad page map is around pmd:800000001801003 . PTE does always seem to be different.

The crashes with the reproducer happen around 1-3 times/hour, with the vivid driver it runs for 10 hours straight, when I terminated due to lack of patience.

Aug 21 17:25:41 kernel: BUG: Bad page map in process Xorg  pte:80706090a070705 pmd:800000001801003
Aug 21 17:25:41 kernel: addr:0000005599400000 vm_flags:200100073 anon_vma:ffffff804a2f20d8 mapping:0000000000000000 index:5599400
Aug 21 17:25:41 kernel: file:(null) fault:0x0 mmap:0x0 read_folio:0x0
Aug 21 17:25:41 kernel: CPU: 0 PID: 466 Comm: Xorg Tainted: G         C         6.0.0-rc1-v8-gc8f41281d1f4 #7
Aug 21 17:25:41 kernel: Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
Aug 21 17:25:41 kernel: Call trace:
Aug 21 17:25:41 kernel: dump_backtrace.part.0 (arch/arm64/kernel/stacktrace.c:184) 
Aug 21 17:25:41 kernel: show_stack (arch/arm64/kernel/stacktrace.c:191) 
Aug 21 17:25:41 kernel: dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) 
Aug 21 17:25:41 kernel: dump_stack (lib/dump_stack.c:114) 
Aug 21 17:25:41 kernel: print_bad_pte (mm/memory.c:567 (discriminator 12)) 
Aug 21 17:25:41 kernel: vm_normal_page (mm/memory.c:638) 
Aug 21 17:25:41 kernel: do_wp_page (mm/memory.c:3384) 
Aug 21 17:25:41 kernel: __handle_mm_fault (mm/memory.c:4929 mm/memory.c:5053) 
Aug 21 17:25:41 kernel: handle_mm_fault (mm/memory.c:5151) 
Aug 21 17:25:41 kernel: do_page_fault (arch/arm64/mm/fault.c:502 arch/arm64/mm/fault.c:602) 
Aug 21 17:25:41 kernel: do_translation_fault (arch/arm64/mm/fault.c:689) 
Aug 21 17:25:41 kernel: do_mem_abort (arch/arm64/mm/fault.c:818) 
Aug 21 17:25:41 kernel: el0_da (arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:142 arch/arm64/kernel/entry-common.c:503) 
Aug 21 17:25:41 kernel: el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:646) 
Aug 21 17:25:41 kernel: el0t_64_sync (arch/arm64/kernel/entry.S:581) 
Aug 21 17:25:41 kernel: Disabling lock debugging due to kernel taint
Aug 21 17:25:41 kernel: ------------[ cut here ]------------
Aug 21 17:25:41 kernel: WARNING: CPU: 0 PID: 466 at mm/memory.c:2923 wp_page_copy (mm/memory.c:2923 mm/memory.c:3117) 
Aug 21 17:25:41 kernel: Modules linked in: rfkill bcm2835_unicam v4l2_dv_timings videobuf2_dma_contig rtc_pcf85063 videobuf2_memops ov9281 videobuf2_v4l2 regmap_i2c videobuf2_co>
Aug 21 17:25:41 kernel: CPU: 0 PID: 466 Comm: Xorg Tainted: G    B    C         6.0.0-rc1-v8-gc8f41281d1f4 #7
Aug 21 17:25:41 kernel: Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
Aug 21 17:25:41 kernel: pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Aug 21 17:25:41 kernel: pc : wp_page_copy (mm/memory.c:2923 mm/memory.c:3117) 
Aug 21 17:25:41 kernel: lr : wp_page_copy (mm/memory.c:2917 mm/memory.c:3117) 
Aug 21 17:25:41 kernel: sp : ffffffc00be77a30
Aug 21 17:25:41 kernel: x29: ffffffc00be77a30 x28: ffffff805faec000 x27: ffffff804bc35e58
Aug 21 17:25:41 kernel: x26: ffffffc00be77cb0 x25: ffffff804a7ac680 x24: ffffffc00be77c88
Aug 21 17:25:41 kernel: x23: 1ffffff8017cef5e x22: ffffff804bc35e18 x21: 0000000000000000
Aug 21 17:25:41 kernel: x20: fffffffe017ebb00 x19: ffffffc00be77c70 x18: 0000000000000000
Aug 21 17:25:41 kernel: x17: 0000000000000000 x16: 0000000000000000 x15: 0000005599400000
Aug 21 17:25:41 kernel: x14: 0000000000000000 x13: 746e696174206c65 x12: ffffffb00bf5da00
Aug 21 17:25:41 kernel: x11: 1ffffff00bf5d9ff x10: ffffffb00bf5d9ff x9 : dfffffc000000000
Aug 21 17:25:41 kernel: x8 : ffffff805faecfff x7 : 0000000000000000 x6 : ffffff805faec000
Aug 21 17:25:41 kernel: x5 : ffffff805faed000 x4 : 0000000000000000 x3 : 0000005599400000
Aug 21 17:25:41 kernel: x2 : 0000000000000f80 x1 : 0000005599400000 x0 : 0000000000001000
Aug 21 17:25:41 kernel: Call trace:
Aug 21 17:25:41 kernel: wp_page_copy (mm/memory.c:2923 mm/memory.c:3117) 
Aug 21 17:25:41 kernel: do_wp_page (mm/memory.c:3404) 
Aug 21 17:25:41 kernel: __handle_mm_fault (mm/memory.c:4929 mm/memory.c:5053) 
Aug 21 17:25:41 kernel: handle_mm_fault (mm/memory.c:5151) 
Aug 21 17:25:41 kernel: do_page_fault (arch/arm64/mm/fault.c:502 arch/arm64/mm/fault.c:602) 
Aug 21 17:25:41 kernel: do_translation_fault (arch/arm64/mm/fault.c:689) 
Aug 21 17:25:41 kernel: do_mem_abort (arch/arm64/mm/fault.c:818) 
Aug 21 17:25:41 kernel: el0_da (arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:142 arch/arm64/kernel/entry-common.c:503) 
Aug 21 17:25:41 kernel: el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:646) 
Aug 21 17:25:41 kernel: el0t_64_sync (arch/arm64/kernel/entry.S:581) 
Aug 21 17:25:41 kernel: ---[ end trace 0000000000000000 ]---
Aug 21 17:25:47 systemd[1]: systemd-fsckd.service: Succeeded.

Steps to reproduce the behaviour

I have the suspicion, that this is more easily triggered when memory is tight, so

  1. disable swap
$ free
               total        used        free      shared  buff/cache   available
Mem:         1615044      417180      355252       37360      842612     1092072
Swap:              0           0           0
  1. add dtoverlay dtoverlay=ov9281, remove kernel security/memory address "fog",
cmdline.txt
-----------
coherent_pool=1M snd_bcm2835.enable_headphones=0 snd_bcm2835.enable_hdmi=1 snd_bcm2835.enable_hdmi=0 video=HDMI-A-1:1920x1080M@60 vc_mem.mem_base=0x3ec00000 vc_mem.mem_size=0x40000000  console=tty1 root=PARTUUID=18bffb3d-02 rootfstype=ext4 fsck.repair=yes rootwait nokaslr nokpti mitigations=off
  1. compile the attached reproducer gcc -o rpidmareproducer rpidmareproducer.c
    rpidmareproducer.c.txt

NB: I have blacklisted, rpivid_hevc, bcm2835_codec, bcm2835-isp, bcm2835_v4l2, bcm2835_mmal_vchiq thus, I do not suspect the issue to be there.

I have found to be able to best reproduce when using boot loops, so I autostart the reproducer ( ".config/autostart/repro.desktop")

[Desktop Entry]
Name=Autostart
Type=Application
Exec=/path/to/rpidmareproducer
Terminal=true

The reproducer works with 1280x720 as image format, this can also be supplied by the vivid driver. Execute on every reboot to set the camera correctly:

#!/bin/bash

while [[ ! -e /dev/video0 && ! -r /dev/video ]] ; do
  sleep 2;
done

DEV="$(v4l2-ctl --list-devices | grep -A 2 fe800000 | grep media | tr -d '[:space:]')"
media-ctl -d $DEV --set-v4l2 '1:0[fmt:Y8_1X8/1280x720 field:none]' -v

# 25 fps
v4l2-ctl -d /dev/v4l-subdev0 --set-ctrl vertical_blanking=4474

What the reproducer does:
It starts streaming from the camera and sometimes executes a syscall . It seems that this is when the system tries to copy pages and fails.

  1. ATTENTION: It will shutdown and reboot your system. But your traces will be logged in journalctl. ( alternative: use watch -n 15 shutdown -c to cancel all occurring reboot requests.)

My kernels are compiled from the rpi repo with additional debug options, e.g. KASAN. To use the vivid driver, enable Kernel Option CONFIG_VIDEO_VIVID=m.

If you want to see the reproducer run for ours without issue, remove the dtoverlay, and modprobe vivid.

Device (s)

Raspberry Pi 4 Mod. B, Raspberry Pi CM4 Lite

System

Raspberry Pi reference 2022-04-04
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 27a8050c3c06e567c794620394a8c2d74262a516, stage2

Aug 9 2022 13:44:40
Copyright (c) 2012 Broadcom
version 273b410636cf8854ca35af91fd738a3d5f8b39b6 (clean) (release) (start)

5.10.110-v8-g2d80ef99861c SMP PREEMPT Fri Aug 19 09:08:48 UTC 2022 aarch64 GNU/Linux
6.0.0-rc1-v8-gc8f41281d1f4

More info in raspinfo.txt

Logs

These redzone-overwritten messages hint to something in the memory code being wrong:

Left Redzone, CM4 , 6.0.0-rc1


[17:28:20] =============================================================================
[17:28:20] BUG kmalloc-cg-512 (Tainted: G         C        ): Left Redzone overwritten
[17:28:20] -----------------------------------------------------------------------------

[17:28:20] 0xffffff8004c50000-0xffffff8004c500ff @offset=0. First byte 0x9 instead of 0xcc
[17:28:20] Allocated in __alloc_skb+0xd8/0x1fc age=6448 cpu=1 pid=1058 
[17:28:20] __slab_alloc.constprop.0 (mm/slub.c:3118) 
[17:28:20] __kmalloc_track_caller (mm/slub.c:3209 mm/slub.c:3251 mm/slub.c:4924) 
[17:28:20] kmalloc_reserve.constprop.0 (net/core/skbuff.c:358) 
[17:28:20] __alloc_skb (net/core/skbuff.c:430) 
[17:28:20] alloc_skb_with_frags (./include/linux/skbuff.h:1257 net/core/skbuff.c:6018) 
[17:28:20] sock_alloc_send_pskb (net/core/sock.c:2663) 
[17:28:20] unix_stream_sendmsg (net/unix/af_unix.c:2168) 
[17:28:20] sock_sendmsg (net/socket.c:717 net/socket.c:734) 
[17:28:20] sock_write_iter (net/socket.c:1108) 
[17:28:20] do_iter_readv_writev (./include/linux/fs.h:2187 fs/read_write.c:729) 
[17:28:20] do_iter_write (fs/read_write.c:855) 
[17:28:20] vfs_writev (fs/read_write.c:928) 
[17:28:20] do_writev (fs/read_write.c:971) 
[17:28:20] __arm64_sys_writev (fs/read_write.c:1041) 
[17:28:20] invoke_syscall (arch/arm64/kernel/syscall.c:38 arch/arm64/kernel/syscall.c:52) 
[17:28:20] el0_svc_common.constprop.0 (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/syscall.c:150) 
[17:28:20] Slab 0xfffffffe00131400 objects=21 used=8 fp=0xffffff8004c50800 flags=0x10201(locked|slab|head|zone=0)
[17:28:20] Object 0xffffff8004c50200 @offset=512 fp=0xffffff8004c50800

[17:28:20] Redzone  ffffff8004c50000: 09 05 0b 08 07 09 09 0a 07 09 07 0b 07 07 0a 05  ................
[17:28:20] Redzone  ffffff8004c50010: 09 07 08 07 09 0a 0a 07 06 08 09 08 08 07 0a 07  ................
[17:28:20] Redzone  ffffff8004c50020: 0c 06 08 0a 05 09 0b 05 08 07 0a 08 09 08 07 0c  ................
[17:28:20] Redzone  ffffff8004c50030: 07 08 06 09 09 08 06 08 07 0a 07 07 07 08 07 08  ................
[17:28:20] Redzone  ffffff8004c50040: 06 0a 09 09 06 0a 0a 08 0a 04 08 07 05 06 07 05  ................
[17:28:20] Redzone  ffffff8004c50050: 05 07 06 08 03 06 06 05 07 0a 06 07 08 01 07 05  ................
[17:28:20] Redzone  ffffff8004c50060: 06 09 05 07 06 07 06 05 03 07 09 05 09 04 09 07  ................
[17:28:20] Redzone  ffffff8004c50070: 05 07 07 07 06 06 0a 07 0a 06 04 06 03 07 07 07  ................
[17:28:20] Redzone  ffffff8004c50080: 05 05 07 08 05 05 04 05 04 06 04 08 05 07 09 07  ................
[17:28:20] Redzone  ffffff8004c50090: 08 05 08 04 04 08 08 04 08 07 07 09 04 05 04 08  ................
[17:28:20] Redzone  ffffff8004c500a0: 07 09 04 06 06 03 06 07 04 05 05 07 06 02 07 06  ................
[17:28:20] Redzone  ffffff8004c500b0: 03 04 07 05 06 03 06 08 03 07 03 04 08 03 04 06  ................
[17:28:20] Redzone  ffffff8004c500c0: 07 04 08 06 07 04 07 03 04 05 05 05 03 08 04 07  ................
[17:28:20] Redzone  ffffff8004c500d0: 04 09 06 06 02 03 05 04 04 09 06 08 07 07 05 04  ................
[17:28:20] Redzone  ffffff8004c500e0: 07 0a 05 06 04 05 04 02 05 07 06 06 06 06 05 08  ................
[17:28:20] Redzone  ffffff8004c500f0: 04 05 04 03 05 07 03 05 05 04 03 09 02 03 08 01  ................
[17:28:20] Redzone  ffffff8004c50100: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c50110: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c50120: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c50130: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c50140: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c50150: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c50160: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c50170: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c50180: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c50190: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c501a0: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c501b0: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c501c0: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c501d0: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c501e0: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Redzone  ffffff8004c501f0: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
[17:28:20] Object   ffffff8004c50200: c0 44 2e 63 80 ff ff ff e7 01 00 00 fc 00 e0 88  .D.c............
[17:28:20] Object   ffffff8004c50210: 08 00 00 00 00 00 00 00 10 00 0b 00 21 00 00 00  ............!...
[17:28:20] Object   ffffff8004c50220: 5f 47 54 4b 5f 48 49 44 45 5f 54 49 54 4c 45 42  _GTK_HIDE_TITLEB
[17:28:20] Object   ffffff8004c50230: 41 52 5f 57 48 45 4e 5f 4d 41 58 49 4d 49 5a 45  AR_WHEN_MAXIMIZE
[17:28:20] Object   ffffff8004c50240: 44 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  D...kkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50250: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50260: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50270: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50280: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50290: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c502a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c502b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c502c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[17:28:20] Object   ffffff8004c502d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[17:28:20] Object   ffffff8004c502e0: 01 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  ....kkkkkkkkkkkk
[17:28:20] Object   ffffff8004c502f0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50300: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50310: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50320: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50330: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50340: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50350: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50360: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50370: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50380: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c50390: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c503a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c503b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c503c0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c503d0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c503e0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[17:28:20] Object   ffffff8004c503f0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
[17:28:20] Redzone  ffffff8004c50400: cc cc cc cc cc cc cc cc                          ........
[17:28:20] Padding  ffffff8004c50460: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c50470: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c50480: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c50490: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c504a0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c504b0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c504c0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c504d0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c504e0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c504f0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c50500: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c50510: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c50520: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c50530: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c50540: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c50550: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c50560: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c50570: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c50580: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c50590: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c505a0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c505b0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c505c0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c505d0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c505e0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] Padding  ffffff8004c505f0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[17:28:20] CPU: 2 PID: 1170 Comm: geany Tainted: G         C         6.0.0-rc1-v8-gc8f41281d1f4 #7
[17:28:20] Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
[17:28:20] Call trace:
[17:28:20] dump_backtrace.part.0 (arch/arm64/kernel/stacktrace.c:184) 
[17:28:20] show_stack (arch/arm64/kernel/stacktrace.c:191) 
[17:28:20] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) 
[17:28:20] dump_stack (lib/dump_stack.c:114) 
[17:28:20] print_trailer (mm/slub.c:886) 
[17:28:20] check_bytes_and_report (mm/slub.c:980) 
[17:28:20] check_object (mm/slub.c:1085) 
[17:28:20] free_debug_processing (mm/slub.c:1369 mm/slub.c:1416) 
[17:28:20] __slab_free (mm/slub.c:3344 (discriminator 1)) 
[17:28:20] ___cache_free (mm/slub.c:3542) 
[17:28:20] qlist_free_all (mm/kasan/quarantine.c:182) 
[17:28:20] kasan_quarantine_reduce (./include/linux/srcu.h:189 mm/kasan/quarantine.c:295) 
[17:28:20] __kasan_slab_alloc (mm/kasan/common.c:447) 
[17:28:20] kmem_cache_alloc (mm/slab.h:727 mm/slub.c:3243 mm/slub.c:3251 mm/slub.c:3258 mm/slub.c:3268) 
[17:28:20] getname_flags.part.0 (fs/namei.c:140) 
[17:28:20] getname_flags (fs/namei.c:205) 
[17:28:20] vfs_fstatat (fs/stat.c:255) 
[17:28:20] __do_sys_newfstatat (fs/stat.c:426) 
[17:28:20] __arm64_sys_newfstatat (fs/stat.c:419) 
[17:28:20] invoke_syscall (arch/arm64/kernel/syscall.c:38 arch/arm64/kernel/syscall.c:52) 
[17:28:20] el0_svc_common.constprop.0 (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/syscall.c:150) 
[17:28:20] do_el0_svc (arch/arm64/kernel/syscall.c:207) 
[17:28:20] el0_svc (arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:142 arch/arm64/kernel/entry-common.c:625) 
[17:28:20] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:643) 
[17:28:20] el0t_64_sync (arch/arm64/kernel/entry.S:581) 
[17:28:20] Disabling lock debugging due to kernel taint
[17:28:20] FIX kmalloc-cg-512: Restoring Left Redzone 0xffffff8004c50000-0xffffff8004c500ff=0xcc
[17:28:20] FIX kmalloc-cg-512: Object at 0xffffff8004c50200 not freed

KASAN null-ptr-deref, CM4, 5.10.110

08:46:58  rpidmareproducer[1130]: now calling system()
08:46:58  systemd-logind[381]: Creating /run/nologin, blocking further logins...
08:46:59  kernel: get_swap_device: Bad swap file entry 00010000
08:46:59  kernel: ==================================================================
08:46:59  kernel: BUG: KASAN: null-ptr-deref in do_swap_page (mm/memory.c:3953) 
08:46:59  kernel: Read of size 8 at addr 0000000000000000 by task gdbus/1113
08:46:59  kernel:
08:46:59  kernel: CPU: 2 PID: 1113 Comm: gdbus Tainted: G         C        5.10.110-v8-g2d80ef99861c #3
08:46:59  kernel: Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
08:46:59  kernel: Call trace:
08:46:59  kernel: dump_backtrace (arch/arm64/kernel/stacktrace.c:184) 
08:46:59  kernel: show_stack (arch/arm64/kernel/stacktrace.c:191) 
08:46:59  kernel: dump_stack (lib/show_mem.c:37) 
08:46:59  kernel: kasan_report (mm/kasan/init.c:455) 
08:46:59  kernel: __asan_load8 (mm/kasan/generic.c:47 mm/kasan/generic.c:65 mm/kasan/generic.c:151 mm/kasan/generic.c:180 mm/kasan/generic.c:256) 
08:46:59  kernel: do_swap_page (mm/memory.c:3953) 
08:46:59  kernel: handle_mm_fault (mm/memory.c:5493) 
08:46:59  kernel: do_page_fault (arch/arm64/mm/fault.c:612) 
08:46:59  kernel: do_translation_fault (arch/arm64/mm/fault.c:685) 
08:46:59  kernel: do_mem_abort (arch/arm64/mm/fault.c:818) 
08:46:59  kernel: el0_da (arch/arm64/kernel/entry-common.c:502) 
08:46:59  kernel: el0_sync_handler+0x168/0x1b0 
08:46:59  kernel: el0_sync+0x180/0x1c0 
08:46:59  kernel: ==================================================================
08:46:59  kernel: Disabling lock debugging due to kernel taint
08:46:59  kernel: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
08:46:59  kernel: Mem abort info:
08:46:59  kernel:   ESR = 0x96000005
08:46:59  kernel:   EC = 0x25: DABT (current EL), IL = 32 bits
08:46:59  kernel:   SET = 0, FnV = 0
08:46:59  kernel:   EA = 0, S1PTW = 0
08:46:59  kernel: Data abort info:
08:46:59  kernel:   ISV = 0, ISS = 0x00000005
08:46:59  kernel:   CM = 0, WnR = 0
08:46:59  kernel: user pgtable: 4k pages, 39-bit VAs, pgdp=000000006b00f000
08:46:59  kernel: [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
08:46:59  kernel: Internal error: Oops: 96000005 [#1] PREEMPT SMP
08:46:59  kernel: Modules linked in: rfkill rtc_pcf85063 regmap_i2c ov9281 snd_soc_hdmi_codec bcm2835_unicam vc4 videobuf2_dma_contig v4l2_dv_timings v3d v4l2_fwnode videobuf2_memops videobuf2_v4l2 videobuf2_common cec gpu_sched i2c_mux_pinctrl i2c_mux gpio_keys videodev drm_kms_helper _hwmon i2c_brcmstb mc snd_soc_core snd_bcm2835(C) snd_compress snd_pcm_dmaengine i2c_bcm2835 snd_pcm snd_timer vc_sm_cma(C) snd syscopyarea sysfillrect sysimgblt fb_sys_fops uio_pdrv_genirq uio nvmem_rmem drm fuse drm_panel_orientation_quirks backlight ip_tables x_tables ipv6
08:46:59  kernel: CPU: 2 PID: 1113 Comm: gdbus Tainted: G    B    C        5.10.110-v8-g2d80ef99861c #3
08:46:59  kernel: Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
08:46:59  kernel: pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
08:46:59  kernel: pc : do_swap_page (mm/memory.c:3953) 
08:46:59  kernel: lr : do_swap_page (mm/memory.c:3953) 
08:46:59  kernel: sp : ffffff8001cdfbc0
08:46:59  kernel: x29: ffffff8001cdfbc0 x28: 000000000000000c
08:46:59  kernel: x27: ffffff8001cdfd08 x26: ffffffd0115f2a20
08:46:59  kernel: x25: 0000000000000c73 x24: ffffff8064fe0a68
08:46:59  kernel: x23: ffffff8001cdfd20 x22: ffffff804a29d4f8
08:46:59  kernel: x21: 0000000000010000 x20: ffffff8001cdfcf0
08:46:59  kernel: x19: 0000000000000000 x18: 0000000000000000
08:46:59  kernel: x17: 0000000000000000 x16: 0000000000000000
08:46:59  kernel: x15: 0000000000000000 x14: 0000000000000000
08:46:59  kernel: x13: 0000000000000000 x12: ffffffc00039bf49
08:46:59  kernel: x11: 1ffffff00039bf48 x10: ffffffc00039bf48
08:46:59  kernel: x9 : ffffffd011030508 x8 : ffffff8001cdfa47
08:46:59  kernel: x7 : 0000000000000001 x6 : 0000003fffc640b8
08:46:59  kernel: x5 : ffffff8001cdfa40 x4 : dfffffd000000000
08:46:59  kernel: x3 : ffffffd011030514 x2 : 0000000000000007
08:46:59  kernel: x1 : ffffff8064fe0040 x0 : 0000000000000001
08:46:59  kernel: Call trace:
08:46:59  kernel: do_swap_page (mm/memory.c:3953) 
08:46:59  kernel: handle_mm_fault (mm/memory.c:5493) 
08:47:00  kernel: do_page_fault (arch/arm64/mm/fault.c:612) 
08:47:00  kernel: do_translation_fault (arch/arm64/mm/fault.c:685) 
08:47:00  kernel: do_mem_abort (arch/arm64/mm/fault.c:818) 
08:47:00  kernel: el0_da (arch/arm64/kernel/entry-common.c:502) 
08:47:00  kernel: el0_sync_handler+0x168/0x1b0 
08:47:00  kernel: el0_sync+0x180/0x1c0 
08:47:00  kernel: Code: aa1503e0 94010527 aa0003f3 940156de (f9400260)
All code
========
  0:	aa1503e0 	mov	x0, x21
  4:	94010527 	bl	0x414a0
  8:	aa0003f3 	mov	x19, x0
  c:	940156de 	bl	0x55b84
 10:*	f9400260 	ldr	x0, [x19]		<-- trapping instruction

Code starting with the faulting instruction
===========================================
  0:	f9400260 	ldr	x0, [x19]
08:47:00  kernel: ---[ end trace f862ea3cea94e4b5 ]---

Bad page map, RPI4 B, 6.0.0-rc1

18:49:47 rpidmareproducer[1021]: now calling system()
18:49:56 rpidmareproducer[1021]: done...closing
18:49:56 kernel: _swap_info_get: Bad swap file entry 40207070a050609
18:49:56 kernel: BUG: Bad page map in process rpidmareproduce  pte:a07070a0506090a pmd:800000001801003
18:49:56 kernel: addr:0000007f83000000 vm_flags:00100073 anon_vma:ffffff80601af008 mapping:0000000000000000 index:7f83000
18:49:56 kernel: file:(null) fault:0x0 mmap:0x0 read_folio:0x0
18:49:56 kernel: CPU: 0 PID: 1021 Comm: rpidmareproduce Tainted: G         C         6.0.0-rc1-v8-gc8f41281d1f4 #7
18:49:56 kernel: Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
18:49:56 kernel: Call trace:
18:49:56 kernel: dump_backtrace.part.0 (arch/arm64/kernel/stacktrace.c:184) 
18:49:56 kernel: show_stack (arch/arm64/kernel/stacktrace.c:191) 
18:49:56 kernel: dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) 
18:49:56 kernel: dump_stack (lib/dump_stack.c:114) 
18:49:56 kernel: print_bad_pte (mm/memory.c:567 (discriminator 12)) 
18:49:56 kernel: unmap_page_range (mm/memory.c:1488 mm/memory.c:1575 mm/memory.c:1604 mm/memory.c:1625 mm/memory.c:1646) 
18:49:56 kernel: unmap_vmas (mm/memory.c:1730) 
18:49:56 kernel: exit_mmap (mm/mmap.c:3114) 
18:49:56 kernel: __mmput (kernel/fork.c:1250 kernel/fork.c:1189) 
18:49:56 kernel: mmput (kernel/fork.c:1209) 
18:49:56 kernel: do_exit (./arch/arm64/include/asm/current.h:19 kernel/exit.c:511 kernel/exit.c:782) 
18:49:56 kernel: do_group_exit (kernel/exit.c:906) 
18:49:56 kernel: __arm64_sys_exit_group (kernel/exit.c:936) 
18:49:56 kernel: invoke_syscall (arch/arm64/kernel/syscall.c:38 arch/arm64/kernel/syscall.c:52) 
18:49:56 kernel: el0_svc_common.constprop.0 (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/syscall.c:150) 
18:49:56 kernel: do_el0_svc (arch/arm64/kernel/syscall.c:207) 
18:49:56 kernel: el0_svc (arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:142 arch/arm64/kernel/entry-common.c:625) 
18:49:56 kernel: el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:643) 
18:49:56 kernel: el0t_64_sync (arch/arm64/kernel/entry.S:581) 
18:49:56 kernel: Disabling lock debugging due to kernel taint
18:49:56 kernel: _swap_info_get: Bad swap file entry 40207070a070c06

RPI 4B, 6.0.0-rc1, page allocation failure

18:20:47 kernel: a.out: page allocation failure: order:8, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
18:20:47 kernel: CPU: 1 PID: 2308 Comm: a.out Tainted: G    B    C         6.0.0-rc1-v8-gc8f41281d1f4 #7
18:20:47 kernel: Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
18:20:47 kernel: Call trace:
18:20:47 kernel: dump_backtrace.part.0 (arch/arm64/kernel/stacktrace.c:184) 
18:20:47 kernel: show_stack (arch/arm64/kernel/stacktrace.c:191) 
18:20:47 kernel: dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) 
18:20:47 kernel: dump_stack (lib/dump_stack.c:114) 
18:20:47 kernel: warn_alloc (mm/page_alloc.c:4379) 
18:20:47 kernel: __alloc_pages (mm/page_alloc.c:5279 mm/page_alloc.c:5550) 
18:20:47 kernel: __dma_direct_alloc_pages.constprop.0 (./include/linux/gfp.h:243 ./include/linux/gfp.h:266 kernel/dma/direct.c:141) 
18:20:47 kernel: dma_direct_alloc (kernel/dma/direct.c:268) 
18:20:47 kernel: dma_alloc_attrs (kernel/dma/mapping.c:511) 
18:20:47 kernel: vb2_dc_alloc (drivers/media/common/videobuf2/videobuf2-dma-contig.c:195 drivers/media/common/videobuf2/videobuf2-dma-contig.c:257) videobuf2_dma_contig
18:20:47 kernel: __vb2_queue_alloc (drivers/media/common/videobuf2/videobuf2-core.c:233 drivers/media/common/videobuf2/videobuf2-core.c:444) videobuf2_common
18:20:47 kernel: vb2_core_reqbufs (drivers/media/common/videobuf2/videobuf2-core.c:838) videobuf2_common
18:20:47 kernel: vb2_ioctl_reqbufs (drivers/media/common/videobuf2/videobuf2-v4l2.c:1008) videobuf2_v4l2
18:20:47 kernel: v4l_reqbufs (drivers/media/v4l2-core/v4l2-ioctl.c:2097) videodev
18:20:47 kernel: __video_do_ioctl (drivers/media/v4l2-core/v4l2-ioctl.c:3043) videodev
18:20:47 kernel: video_usercopy (drivers/media/v4l2-core/v4l2-ioctl.c:3392) videodev
18:20:47 kernel: video_ioctl2 (drivers/media/v4l2-core/v4l2-ioctl.c:3440) videodev
18:20:47 kernel: v4l2_ioctl (drivers/media/v4l2-core/v4l2-dev.c:368) videodev
18:20:47 kernel: __arm64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:870 fs/ioctl.c:856 fs/ioctl.c:856) 
18:20:47 kernel: invoke_syscall (arch/arm64/kernel/syscall.c:38 arch/arm64/kernel/syscall.c:52) 
18:20:47 kernel: el0_svc_common.constprop.0 (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/syscall.c:150) 
18:20:47 kernel: do_el0_svc (arch/arm64/kernel/syscall.c:207) 
18:20:47 kernel: el0_svc (arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:142 arch/arm64/kernel/entry-common.c:625) 
18:20:47 kernel: el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:643) 
18:20:47 kernel: el0t_64_sync (arch/arm64/kernel/entry.S:581) 
18:20:47 kernel: Mem-Info:
18:20:47 kernel: active_anon:298 inactive_anon:251827 isolated_anon:0
18:20:47 kernel: Node 0 active_anon:1192kB inactive_anon:1007308kB active_file:181180kB inactive_file:143764kB unevictable:80kB isolated(anon):0kB isolated(file):128kB mapped:233440kB dirty:3788kB writeback:0kB shmem:31372kB writeback_tmp:0kB kernel_stack:14624kB pagetables:11512kB all_unreclaimable? no
18:20:47 kernel: DMA free:19708kB boost:4096kB min:12972kB low:15188kB high:17404kB reserved_highatomic:0KB active_anon:364kB inactive_anon:645092kB active_file:37920kB inactive_file:87732kB unevictable:0kB writepending:2540kB present:970752kB managed:876496kB mlocked:0kB bounce:0kB free_pcp:184kB local_pcp:0kB free_cma:2460kB
18:20:47 kernel: lowmem_reserve[]: 0 724 724 724
18:20:47 kernel: DMA32 free:32496kB boost:16884kB min:24388kB low:26264kB high:28140kB reserved_highatomic:4096KB active_anon:828kB inactive_anon:362216kB active_file:142688kB inactive_file:55984kB unevictable:80kB writepending:1248kB present:1048576kB managed:741396kB mlocked:80kB bounce:0kB free_pcp:248kB local_pcp:0kB free_cma:0kB
18:20:47 kernel: lowmem_reserve[]: 0 0 0 0
18:20:47 kernel: DMA: 199*4kB (UMEC) 177*8kB (UMEC) 178*16kB (UMEC) 241*32kB (UMEC) 86*64kB (UMEC) 7*128kB (UEC) 2*256kB (C) 1*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 20196kB
18:20:47 kernel: DMA32: 245*4kB (UMEH) 149*8kB (UMH) 149*16kB (UMH) 665*32kB (UMEH) 113*64kB (UE) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33068kB
18:20:47 kernel: 89083 total pagecache pages
18:20:47 kernel: 0 pages in swap cache
18:20:47 kernel: Free swap  = 0kB
18:20:47 kernel: Total swap = 0kB
18:20:47 kernel: 504832 pages RAM
18:20:47 kernel: 0 pages HighMem/MovableOnly
18:20:47 kernel: 100359 pages reserved
18:20:47 kernel: 81920 pages cma reserved
18:20:47 kernel: unicam fe801000.csi: dma alloc of size 921600 failed

Additional context

+cc @naushir @pelwell

@pelwell
Copy link
Contributor

pelwell commented Aug 22, 2022

Interesting - could this be due to the same underlying issue as #5133?

@olokos
Copy link

olokos commented Aug 22, 2022

I've read this entire issue and the logs, and I couldn't find anything that relates to #5133
The tainted modules are different, the stack traces are completely different, the kernel is also different (5.15 for me, 5.10 and 6 here), I also don't use CM4, just Pi + microSD + IMX477/HQ Camera.

I'm not entirely sure how the kernel structure is built for Pi, so don't take my words as something 100% correct, I could be wrong here, but this doesn't seem related to me, considering that my logs were twice almost exactly the same, while here the logs structure itself is vastly different.

I could be wrong though and those 2 issues could be related, I just couldn't find any common error messages with my issue.

Moreover, steps to reproduce this issue are to disable swap.

I'm using 2GB swap for months now, far before this or my issue happened.
I'm also almost never actually using the swap, since my workload isn't at all RAM constrained, but CPU-Intensive

image

@pelwell
Copy link
Contributor

pelwell commented Aug 22, 2022

The common factors are "camera" and "memory corruption".

@olokos
Copy link

olokos commented Aug 22, 2022

Right, this could be related, as I said, don't take my word for it, I just wanted to show my perspective, but those 2 issues could be directly related aswell, I have also updated my issue with additional data.

@kralo
Copy link
Contributor Author

kralo commented Aug 23, 2022

I set up a third device, CM4, Kernel 5.19.1, reproducer does not take long.

Can I do something to produce more meaningful debug traces?

A 23 11:30:45 rpidmareproducer[911]: now calling system()
A 23 11:30:54 rpidmareproducer[911]: done...closing
A 23 11:30:54 : BUG: Bad page map in process rpidmareproduce pte:f5ffedfefbfefcf1 pmd:800000001c01003
A 23 11:30:54 : addr:0000007f9fa01000 vm_flags:00100073 anon_vma:ffffff8059578a98 mapping:0000000000000000 index:7f9fa01
A 23 11:30:54 : [file:(null)](file:///(null)) fault:0x0 mmap:0x0 read_folio:0x0
A 23 11:30:54 : CPU: 0 PID: 911 Comm: rpidmareproduce Tainted: G C        5.19.1-v8-ge7b8b60fb78b #10 8451487651094e40323d5adcf3324a1f1ff7f9cc
A 23 11:30:54 : Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
A 23 11:30:54 : Call trace:
A 23 11:30:54 : dump_backtrace.part.0 (arch/arm64//stacktrace.c:200)
A 23 11:30:54 : show_stack (arch/arm64//stacktrace.c:207)
A 23 11:30:54 : dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4))
A 23 11:30:54 : dump_stack (lib/dump_stack.c:114)
A 23 11:30:54 : print_bad_pte (mm/memory.c:567 (discriminator 12))
A 23 11:30:54 : vm_normal_page (mm/memory.c:630)
A 23 11:30:54 : unmap_page_range (mm/memory.c:1425 mm/memory.c:1567 mm/memory.c:1596 mm/memory.c:1617 mm/memory.c:1638)
A 23 11:30:54 : unmap_vmas (mm/memory.c:1722)
A 23 11:30:54 : exit_mmap (mm/mmap.c:3163)
A 23 11:30:54 : __mmput (/fork.c:1250 (discriminator 2) /fork.c:1189 (discriminator 2))
A 23 11:30:54 : mmput (/fork.c:1209)
A 23 11:30:54 : do_exit (./arch/arm64/include/asm/current.h:19 /exit.c:511 /exit.c:782)
A 23 11:30:54 : do_group_exit (/exit.c:906)
A 23 11:30:54 : __arm64_sys_exit_group (/exit.c:936)
A 23 11:30:54 : invoke_syscall (arch/arm64//syscall.c:38 arch/arm64//syscall.c:52)
A 23 11:30:54 : el0_svc_common.constprop.0 (./arch/arm64/include/asm/daifflags.h:28 arch/arm64//syscall.c:150)
A 23 11:30:54 : do_el0_svc (arch/arm64//syscall.c:207)
A 23 11:30:54 : el0_svc (arch/arm64//entry-common.c:133 arch/arm64//entry-common.c:142 arch/arm64//entry-common.c:625)
A 23 11:30:54 : el0t_64_sync_handler (arch/arm64//entry-common.c:643)
A 23 11:30:54 : el0t_64_sync (arch/arm64//entry.S:581)
A 23 11:30:54 : Disabling lock debugging due to  taint
A 23 11:30:54 : BUG: Bad page map in process rpidmareproduce pte:61768fb1e0feffff pmd:800000001c01003
A 23 11:30:54 : addr:0000007f9fa14000 vm_flags:00100073 anon_vma:ffffff8059578a98 mapping:0000000000000000 index:7f9fa14
A 23 11:30:54 : [file:(null)](file:///(null)) fault:0x0 mmap:0x0 read_folio:0x0
A 23 11:30:54 : CPU: 0 PID: 911 Comm: rpidmareproduce Tainted: G    B C        5.19.1-v8-ge7b8b60fb78b #10 8451487651094e40323d5adcf3324a1f1ff7f9cc
A 23 11:30:54 : Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
A 23 11:30:54 : Call trace:
A 23 11:30:54 : dump_backtrace.part.0 (arch/arm64//stacktrace.c:200)
A 23 11:30:54 : show_stack (arch/arm64//stacktrace.c:207)
A 23 11:30:54 : dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4))
A 23 11:30:54 : dump_stack (lib/dump_stack.c:114)
A 23 11:30:54 : print_bad_pte (mm/memory.c:567 (discriminator 12))
A 23 11:30:54 : vm_normal_page (mm/memory.c:630)
A 23 11:30:54 : unmap_page_range (mm/memory.c:1425 mm/memory.c:1567 mm/memory.c:1596 mm/memory.c:1617 mm/memory.c:1638)
A 23 11:30:54 : unmap_vmas (mm/memory.c:1722)
A 23 11:30:54 : exit_mmap (mm/mmap.c:3163)
A 23 11:30:54 : __mmput (/fork.c:1250 (discriminator 2) /fork.c:1189 (discriminator 2))
A 23 11:30:54 : mmput (/fork.c:1209)
A 23 11:30:54 : do_exit (./arch/arm64/include/asm/current.h:19 /exit.c:511 /exit.c:782)
A 23 11:30:54 : do_group_exit (/exit.c:906)
A 23 11:30:54 : __arm64_sys_exit_group (/exit.c:936)
A 23 11:30:54 : invoke_syscall (arch/arm64//syscall.c:38 arch/arm64//syscall.c:52)
A 23 11:30:54 : el0_svc_common.constprop.0 (./arch/arm64/include/asm/daifflags.h:28 arch/arm64//syscall.c:150)
A 23 11:30:54 : do_el0_svc (arch/arm64//syscall.c:207)
A 23 11:30:54 : el0_svc (arch/arm64//entry-common.c:133 arch/arm64//entry-common.c:142 arch/arm64//entry-common.c:625)
A 23 11:30:54 : el0t_64_sync_handler (arch/arm64//entry-common.c:643)
A 23 11:30:54 : el0t_64_sync (arch/arm64//entry.S:581)

@olokos
Copy link

olokos commented Aug 23, 2022

I think /var/log/messages.log is a very good source of logs.
It contains dmesg logs from multiple reboots, with archiving also, so you probably have a lot of those aswell.
Try to look into that file and find line that contains tainted, usually there will be a stack trace nearby.
I suggest you upload a cropped file that contains ~100 logs before and after the stack trace and the full file, then we'll know for sure if Your bug is related to my #5133 - speaking of which, just crashed again.

@kralo
Copy link
Contributor Author

kralo commented Aug 23, 2022

No, olokos. Spamming these issues with megabytes of logs won't get us anywhere. I am able to find stacktraces, and I can annotate the line numbers because I compile the kernel myself.

My issue has a working reproducer. What I was asking was if there was a hint to produce better logs by enabling more kernel debug options, i.e. I tried the cma_alloc ones, but they do not add any insight.

I will wait for someone from the kernel team to hear if they can reproduce.

@pelwell
Copy link
Contributor

pelwell commented Aug 23, 2022

We're still in holiday season, but this issue will get some attention in the not too distant future.

@olokos
Copy link

olokos commented Aug 24, 2022

@kralo I am not spamming issues with logs, on the contrary, I mostly included files directly in the comment, so there's no wall of logs as you open my issue. I also often use hastebin/pastebin in order to not clutter the issue page, there are many options out there! ;)

You asked for a hint to produce better logs, but you never provided logs what happened before, which I specifically asked for. That makes it much more difficult to compare your logs to mine, on top of that also Your log doesn't seem to be a typical DMESG output, which IMO makes it easier to see the time differences between calls.

We're using the same Pi, we're both connecting the camera directly to the Pi with flex tape, except that you use vivid and I use libcamera + libcamera-apps, but they most likely still use similar kernel functions to operate.
Also IMX477R vs ov9281 but the rest checks out.

Sometimes stack trace is a direct result of what has happened before, which you have chosen not to include, for reasons unknown to me.

The only reason why I'm even commenting here, is that Pelwell pointed out that this issue might be related to mine and all I want is for my camera stream to be very stable and reliable, otherwise I wouldn't write this comment.

@kralo
Copy link
Contributor Author

kralo commented Aug 25, 2022

I compiled the kernel with CONFIG_CMA_DEBUG=y and CONFIG_CMA_DEBUGFS=y. This led me to the bcm2835-unicam.c and
I enabled more verbose debug output with

$ cat /etc/modprobe.d/unicam.conf
# 3 is very verbose, 2 medium
options bcm2835_unicam debug=2

As my Application (and my reproducer, willfully) do, they "stall" the buffer pipeline (by sleep()'ing, because all the buffers get filled) and I get these messages
[ 104.206562] unicam fe801000.csi: ISR: [0] Dropping frame, buffer not available at FS

In bcm2835-unicam.c, there is this

/*
 * Size of the dummy buffer. Can be any size really, but the DMA
 * allocation works in units of page sizes.
 */
#define DUMMY_BUF_SIZE		(PAGE_SIZE)

This buffer, which on my RPi4CM is 4K, is where the CSI-Frame-Handler's write address gets directed to, when there is no (vb2-)buf ready.

But will the UNICAM_IBEA0 really respect the Endaddr and never write past it's boundary?

dma_addr_t endaddr = dmaaddr + buffer_size;

	if (pad_id == IMAGE_PAD) {
		reg_write(dev, UNICAM_IBSA0, dmaaddr);
		reg_write(dev, UNICAM_IBEA0, endaddr);

I could not find clear documentation about this. Someone with access to more verbose videocore iv specification should check this.
In the meantime my working theory is that the image sensor data got written to a buffer that was to small and spilled over.

As a quick test, I have increased the size of the dummy buffer, by #define DUMMY_BUF_SIZE (PAGE_SIZE*230) which is sufficient for my image size. My reproducer ran for 6 hours without problems, I stopped it to lack of patience but will run again this night.

Out-of-bounds write by DMA would be such a nice explanation...

@pelwell
Copy link
Contributor

pelwell commented Aug 26, 2022

That's a useful finding. It would be even better if you could initialise the start of the second page of the dummy buffer with a fixed pattern, and check periodically that it hasn't changed - something like this:

diff --git a/drivers/media/platform/bcm2835/bcm2835-unicam.c b/drivers/media/platform/bcm2835/bcm2835-unicam.c
index cad7f018b221d..a3c434e74900a 100644
--- a/drivers/media/platform/bcm2835/bcm2835-unicam.c
+++ b/drivers/media/platform/bcm2835/bcm2835-unicam.c
@@ -131,6 +131,11 @@ MODULE_PARM_DESC(media_controller, "Use media controller API");
  * allocation works in units of page sizes.
  */
 #define DUMMY_BUF_SIZE         (PAGE_SIZE)
+#define PADDED_DUMMY_BUF_SIZE  (PAGE_SIZE * 230)
+
+static const uint32_t guard_pattern[] = {
+       0x5a5a1234, 0xffffffff, 0xffffffff, 0x5678a5a5
+};
 
 enum pad_types {
        IMAGE_PAD,
@@ -852,6 +857,12 @@ static void unicam_schedule_dummy_buffer(struct unicam_node *node)
 static void unicam_process_buffer_complete(struct unicam_node *node,
                                           unsigned int sequence)
 {
+       if (memcmp(node->dummy_buf_cpu_addr + DUMMY_BUF_SIZE, guard_pattern, sizeof(guard_pattern)))
+       {
+               pr_err("%s: guard pattern corrupted:\n%*phN\n", __func__,
+                      sizeof(guard_pattern), node->dummy_buf_cpu_addr + DUMMY_BUF_SIZE);
+               memcpy(node->dummy_buf_cpu_addr + DUMMY_BUF_SIZE, guard_pattern, sizeof(guard_pattern));
+       }
        node->cur_frm->vb.field = node->m_fmt.field;
        node->cur_frm->vb.sequence = sequence;
 
@@ -3004,13 +3015,16 @@ static int register_node(struct unicam_device *unicam, struct unicam_node *node,
        media_entity_pads_init(&vdev->entity, 1, &node->pad);
 
        node->dummy_buf_cpu_addr = dma_alloc_coherent(&unicam->pdev->dev,
-                                                     DUMMY_BUF_SIZE,
+                                                     PADDED_DUMMY_BUF_SIZE,
                                                      &node->dummy_buf_dma_addr,
                                                      GFP_KERNEL);
        if (!node->dummy_buf_cpu_addr) {
                unicam_err(unicam, "Unable to allocate dummy buffer.\n");
                return -ENOMEM;
        }
+
+       memcpy(node->dummy_buf_cpu_addr + DUMMY_BUF_SIZE, guard_pattern, sizeof(guard_pattern));
+
        if (!unicam->mc_api) {
                if (pad_id == METADATA_PAD ||
                    !v4l2_subdev_has_op(unicam->sensor, video, s_std)) {
@@ -3104,7 +3118,7 @@ static void unregister_nodes(struct unicam_device *unicam)
                struct unicam_node *node = &unicam->node[i];
 
                if (node->dummy_buf_cpu_addr) {
-                       dma_free_coherent(&unicam->pdev->dev, DUMMY_BUF_SIZE,
+                       dma_free_coherent(&unicam->pdev->dev, PADDED_DUMMY_BUF_SIZE,
                                          node->dummy_buf_cpu_addr,
                                          node->dummy_buf_dma_addr);
                }

@pelwell
Copy link
Contributor

pelwell commented Aug 26, 2022

[ Patch updated - it was missing the DUMMY_BUF_SIZE offsets for the guard pattern. ]

@6by9
Copy link
Contributor

6by9 commented Aug 26, 2022

IIRC The buffer end address in Unicam was checked at the end of a line.
With IMX477 in full res mode (4056x3040 12bpp) one line would be longer than a single page. I'm initially surprised it's an issue with OV9281.

@naushir is out of the office until Tuesday, but we'll discuss it then.
In the firmware the end address would have been set to a multiple of the stride, therefore a comparison at the end of the line of the start address for the next line would be wrap correctly. With the kernel buffer being a page size, the start of the line would be within the buffer, but the end of the line wouldn't be.
Allocating a dummy buffer of size DUMMY_BUF_SIZE * 2, but passing DUMMY_BUF_SIZE to unicam_wr_dma_addr may be a solution.

It should be noted that the dummy buffer will only ever be written to in the event of a dropped frame due to buffers not being cycled correctly / fast enough.

@naushir
Copy link
Contributor

naushir commented Aug 30, 2022

Some of these call stacks look suspiciously similar to those in raspberrypi/rpicam-apps#246. For that issue there was a suspicion it was power related, and using over_voltage=xx helped resolve the crash.

But will the UNICAM_IBEA0 really respect the Endaddr and never write past it's boundary?

This should be the expected behaviour of the hardware, according to the spec. Clearly this may not be true based on your observations, so there may be some hardware bug that is in play. Note that the dummy buffer will be used when there are frame drops occurring. This is not entirely uncommon, so the overwrite does not seem to be happening on all occasions or we would see this much more often. Perhaps there's some other interaction when the system/memory bus is heavily loaded?

In the firmware the end address would have been set to a multiple of the stride, therefore a comparison at the end of the line of the start address for the next line would be wrap correctly. With the kernel buffer being a page size, the start of the line would be within the buffer, but the end of the line wouldn't be.

I would have hoped that the Unicam HW checks the buffer overwrite possibility on every AXI burst operation rather than just at the line boundary, but there is no way to confirm this. If that were the case, the image width would not matter for sizing the dummy buffer.

Allocating a dummy buffer of size DUMMY_BUF_SIZE * 2, but passing DUMMY_BUF_SIZE to unicam_wr_dma_addr may be a solution.

Assuming the HW has a bug, this feels more like the right solution here - maybe even size the buffer * 4 to be sure.

@naushir
Copy link
Contributor

naushir commented Aug 30, 2022

Predictably, I cannot reproduce the crash (in the 4-5 hours it has been running so far) with the rpidmareproducer script! However, I am running a very different configuration base:

  • 32-bit Raspberry Pi OS image.
  • imx477 camera.
  • Running on a Raspberry Pi 4B board, not a CM4.

I also have stress running in the background to make things busier just to see if that can provoke the crash. I'll switch to a 64-bit OS tomorrow if I still have no luck with my current image.

@naushir
Copy link
Contributor

naushir commented Aug 31, 2022

Right, so I definitely see the overruns in the dummy buffer occurring when adding/testing guard-words on the end of the buffer. I have not reproduced the overrun with libcamera, only with the provided rpidmareproducer.c test application.

The good news is that doubling the size of the dummy buffer and providing Unicam with half that size seems to "fix" the overrun - i.e. overruns still occur, but we do not trample outside of our allocation.

@kralo would you be able to try out this change from #5157 and let me know if you still get crashes with your setup?

@kralo
Copy link
Contributor Author

kralo commented Sep 1, 2022

I can reproduce the errors when checking with the guard patterns, this was an excellent idea.

However, to anyone trying this I recommend checking for (memcmp () != 0)

diff --git a/drivers/media/platform/bcm2835/bcm2835-unicam.c b/drivers/media/platform/bcm2835/bcm2835-unicam.c
index 979c4bf37..dd265e2e5 100644
--- a/drivers/media/platform/bcm2835/bcm2835-unicam.c
+++ b/drivers/media/platform/bcm2835/bcm2835-unicam.c
@@ -855,10 +855,18 @@ static void unicam_schedule_dummy_buffer(struct unicam_node *node)
 static void unicam_process_buffer_complete(struct unicam_node *node,
                                           unsigned int sequence)
 {
-       if (!memcmp(node->dummy_buf_cpu_addr + DUMMY_BUF_SIZE, guard_pattern, sizeof(guard_pattern)))
+       if (memcmp(node->dummy_buf_cpu_addr + DUMMY_BUF_SIZE, guard_pattern, sizeof(guard_pattern)) != 0)
        {
                pr_err("%s: guard pattern corrupted:\n%*phN\n", __func__,
                       sizeof(guard_pattern), node->dummy_buf_cpu_addr + DUMMY_BUF_SIZE);
+
+                         pr_err("%s: guard pattern is :\n%*phN\n", __func__,
+                      sizeof(guard_pattern), guard_pattern);
+
+                       pr_err(" guard memcmp result %d",
+                      memcmp(node->dummy_buf_cpu_addr + DUMMY_BUF_SIZE, guard_pattern, sizeof(guard_pattern)));
+
+
                memcpy(node->dummy_buf_cpu_addr + DUMMY_BUF_SIZE, guard_pattern, sizeof(guard_pattern));

        }

This yields

[  181.781915] unicam fe801000.csi: ISR: [0] Dropping frame, buffer not available at FS
[  181.793360] unicam_process_buffer_complete: guard pattern corrupted:
               ff00ff00ff00ff00ff00ff00ff00ff00
[  181.793435] unicam_process_buffer_complete: guard pattern is :
               34125a5affffffffffffffffa5a57856
[  181.793452]  guard memcmp result 203

which clearly is camera pixel data, as my ov9281 ist set to generate the testpattern, i.e. v4l2-ctl -d /dev/v4l-subdev0 --set-ctrl test_pattern=1

So I will now run the test for a couple hours and report in the pull req.

@pelwell
Copy link
Contributor

pelwell commented Sep 1, 2022

Thanks for the memcmp correction (the code wasn't even compiled) - I've updated the code fragment to avoid confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants