-
Notifications
You must be signed in to change notification settings - Fork 129
bpf: permit map_ptr arithmetic with opcode add and offset 0 #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
to bpf map fields") added support to access map fields with CORE support. For example, struct bpf_map { __u32 max_entries; } __attribute__((preserve_access_index)); struct bpf_array { struct bpf_map map; __u32 elem_size; } __attribute__((preserve_access_index)); struct { __uint(type, BPF_MAP_TYPE_ARRAY); __uint(max_entries, 4); __type(key, __u32); __type(value, __u32); } m_array SEC(".maps"); SEC("cgroup_skb/egress") int cg_skb(void *ctx) { struct bpf_array *array = (struct bpf_array *)&m_array; /* .. array->map.max_entries .. */ } In kernel, bpf_htab has similar structure, struct bpf_htab { struct bpf_map map; ... } In the above cg_skb(), to access array->map.max_entries, with CORE, the clang will generate two builtin's. base = &m_array; /* access array.map */ map_addr = __builtin_preserve_struct_access_info(base, 0, 0); /* access array.map.max_entries */ max_entries_addr = __builtin_preserve_struct_access_info(map_addr, 0, 0); max_entries = *max_entries_addr; In the current llvm, if two builtin's are in the same function or in the same function after inlining, the compiler is smart enough to chain them together and generates like below: base = &m_array; max_entries = *(base + reloc_offset); /* reloc_offset = 0 in this case */ and we are fine. But if we force no inlining for one of functions in test_map_ptr() selftest, e.g., check_default(), the above two __builtin_preserve_* will be in two different functions. In this case, we will have code like: func check_hash(): reloc_offset_map = 0; base = &m_array; map_base = base + reloc_offset_map; check_default(map_base, ...) func check_default(map_base, ...): max_entries = *(map_base + reloc_offset_max_entries); In kernel, map_ptr (CONST_PTR_TO_MAP) does not allow any arithmetic. The above "map_base = base + reloc_offset_map" will trigger a verifier failure. ; VERIFY(check_default(&hash->map, map)); 0: (18) r7 = 0xffffb4fe8018a004 2: (b4) w1 = 110 3: (63) *(u32 *)(r7 +0) = r1 R1_w=invP110 R7_w=map_value(id=0,off=4,ks=4,vs=8,imm=0) R10=fp0 ; VERIFY_TYPE(BPF_MAP_TYPE_HASH, check_hash); 4: (18) r1 = 0xffffb4fe8018a000 6: (b4) w2 = 1 7: (63) *(u32 *)(r1 +0) = r2 R1_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R2_w=invP1 R7_w=map_value(id=0,off=4,ks=4,vs=8,imm=0) R10=fp0 8: (b7) r2 = 0 9: (18) r8 = 0xffff90bcb500c000 11: (18) r1 = 0xffff90bcb500c000 13: (0f) r1 += r2 R1 pointer arithmetic on map_ptr prohibited To fix the issue, let us permit map_ptr + 0 arithmetic which will result in exactly the same map_ptr. Signed-off-by: Yonghong Song <[email protected]> --- kernel/bpf/verifier.c | 4 ++++ 1 file changed, 4 insertions(+)
one of subtests, which will fail the test without previous verifier change. Also added to verifier test for both "map_ptr += scalar" and "scalar += map_ptr" arithmetic. Signed-off-by: Yonghong Song <[email protected]> --- .../selftests/bpf/progs/map_ptr_kern.c | 10 +++++- .../testing/selftests/bpf/verifier/map_ptr.c | 32 +++++++++++++++++++ 2 files changed, 41 insertions(+), 1 deletion(-)
Master branch: bc0b5a0 patch https://patchwork.ozlabs.org/project/netdev/patch/[email protected]/ applied successfully |
At least one diff in series https://patchwork.ozlabs.org/project/netdev/list/?series=200277 irrelevant now. Closing PR. |
The test_generic_metric() missed to release entries in the pctx. Asan reported following leak (and more): Direct leak of 128 byte(s) in 1 object(s) allocated from: #0 0x7f4c9396980e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e) #1 0x55f7e748cc14 in hashmap_grow (/home/namhyung/project/linux/tools/perf/perf+0x90cc14) #2 0x55f7e748d497 in hashmap__insert (/home/namhyung/project/linux/tools/perf/perf+0x90d497) #3 0x55f7e7341667 in hashmap__set /home/namhyung/project/linux/tools/perf/util/hashmap.h:111 #4 0x55f7e7341667 in expr__add_ref util/expr.c:120 #5 0x55f7e7292436 in prepare_metric util/stat-shadow.c:783 #6 0x55f7e729556d in test_generic_metric util/stat-shadow.c:858 #7 0x55f7e712390b in compute_single tests/parse-metric.c:128 #8 0x55f7e712390b in __compute_metric tests/parse-metric.c:180 #9 0x55f7e712446d in compute_metric tests/parse-metric.c:196 #10 0x55f7e712446d in test_dcache_l2 tests/parse-metric.c:295 #11 0x55f7e712446d in test__parse_metric tests/parse-metric.c:355 #12 0x55f7e70be09b in run_test tests/builtin-test.c:410 #13 0x55f7e70be09b in test_and_print tests/builtin-test.c:440 #14 0x55f7e70c101a in __cmd_test tests/builtin-test.c:661 #15 0x55f7e70c101a in cmd_test tests/builtin-test.c:807 #16 0x55f7e7126214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312 #17 0x55f7e6fc41a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364 #18 0x55f7e6fc41a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408 #19 0x55f7e6fc41a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538 #20 0x7f4c93492cc9 in __libc_start_main ../csu/libc-start.c:308 Fixes: 6d432c4 ("perf tools: Add test_generic_metric function") Signed-off-by: Namhyung Kim <[email protected]> Acked-by: Jiri Olsa <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Link: http://lore.kernel.org/lkml/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
This fix is for a failure that occurred in the DWARF unwind perf test. Stack unwinders may probe memory when looking for frames. Memory sanitizer will poison and track uninitialized memory on the stack, and on the heap if the value is copied to the heap. This can lead to false memory sanitizer failures for the use of an uninitialized value. Avoid this problem by removing the poison on the copied stack. The full msan failure with track origins looks like: ==2168==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x559ceb10755b in handle_cfi elfutils/libdwfl/frame_unwind.c:648:8 #1 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4 #2 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7 #3 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10 #4 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17 #5 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17 #6 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14 #7 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10 #8 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8 #9 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8 #10 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26 #11 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0) #12 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2 #13 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9 #14 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9 #15 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8 #16 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9 #17 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9 #18 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4 #19 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9 #20 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11 #21 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8 #22 0x559cea95fbce in run_argv tools/perf/perf.c:409:2 #23 0x559cea95fbce in main tools/perf/perf.c:539:3 Uninitialized value was stored to memory at #0 0x559ceb106acf in __libdwfl_frame_reg_set elfutils/libdwfl/frame_unwind.c:77:22 #1 0x559ceb106acf in handle_cfi elfutils/libdwfl/frame_unwind.c:627:13 #2 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4 #3 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7 #4 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10 #5 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17 #6 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17 #7 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14 #8 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10 #9 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8 #10 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8 #11 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26 #12 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0) #13 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2 #14 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9 #15 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9 #16 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8 #17 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9 #18 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9 #19 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4 #20 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9 #21 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11 #22 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8 #23 0x559cea95fbce in run_argv tools/perf/perf.c:409:2 #24 0x559cea95fbce in main tools/perf/perf.c:539:3 Uninitialized value was stored to memory at #0 0x559ceb106a54 in handle_cfi elfutils/libdwfl/frame_unwind.c:613:9 #1 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4 #2 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7 #3 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10 #4 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17 #5 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17 #6 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14 #7 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10 #8 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8 #9 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8 #10 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26 #11 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0) #12 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2 #13 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9 #14 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9 #15 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8 #16 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9 #17 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9 #18 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4 #19 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9 #20 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11 #21 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8 #22 0x559cea95fbce in run_argv tools/perf/perf.c:409:2 #23 0x559cea95fbce in main tools/perf/perf.c:539:3 Uninitialized value was stored to memory at #0 0x559ceaff8800 in memory_read tools/perf/util/unwind-libdw.c:156:10 #1 0x559ceb10f053 in expr_eval elfutils/libdwfl/frame_unwind.c:501:13 #2 0x559ceb1060cc in handle_cfi elfutils/libdwfl/frame_unwind.c:603:18 #3 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4 #4 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7 #5 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10 #6 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17 #7 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17 #8 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14 #9 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10 #10 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8 #11 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8 #12 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26 #13 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0) #14 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2 #15 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9 #16 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9 #17 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8 #18 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9 #19 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9 #20 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4 #21 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9 #22 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11 #23 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8 #24 0x559cea95fbce in run_argv tools/perf/perf.c:409:2 #25 0x559cea95fbce in main tools/perf/perf.c:539:3 Uninitialized value was stored to memory at #0 0x559cea9027d9 in __msan_memcpy llvm/llvm-project/compiler-rt/lib/msan/msan_interceptors.cpp:1558:3 #1 0x559cea9d2185 in sample_ustack tools/perf/arch/x86/tests/dwarf-unwind.c:41:2 #2 0x559cea9d202c in test__arch_unwind_sample tools/perf/arch/x86/tests/dwarf-unwind.c:72:9 #3 0x559ceabc9cbd in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:106:6 #4 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26 #5 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0) #6 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2 #7 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9 #8 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9 #9 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8 #10 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9 #11 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9 #12 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4 #13 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9 #14 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11 #15 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8 #16 0x559cea95fbce in run_argv tools/perf/perf.c:409:2 #17 0x559cea95fbce in main tools/perf/perf.c:539:3 Uninitialized value was created by an allocation of 'bf' in the stack frame of function 'perf_event__synthesize_mmap_events' #0 0x559ceafc5f60 in perf_event__synthesize_mmap_events tools/perf/util/synthetic-events.c:445 SUMMARY: MemorySanitizer: use-of-uninitialized-value elfutils/libdwfl/frame_unwind.c:648:8 in handle_cfi Signed-off-by: Ian Rogers <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: [email protected] Cc: Jiri Olsa <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sandeep Dasgupta <[email protected]> Cc: Stephane Eranian <[email protected]> Link: http://lore.kernel.org/lkml/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Calling btrfs_qgroup_reserve_meta_prealloc from btrfs_delayed_inode_reserve_metadata can result in flushing delalloc while holding a transaction and delayed node locks. This is deadlock prone. In the past multiple commits: * ae5e070 ("btrfs: qgroup: don't try to wait flushing if we're already holding a transaction") * 6f23277 ("btrfs: qgroup: don't commit transaction when we already hold the handle") Tried to solve various aspects of this but this was always a whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock scenario involving btrfs_delayed_node::mutex. Namely, one thread can call btrfs_dirty_inode as a result of reading a file and modifying its atime: PID: 6963 TASK: ffff8c7f3f94c000 CPU: 2 COMMAND: "test" #0 __schedule at ffffffffa529e07d #1 schedule at ffffffffa529e4ff #2 schedule_timeout at ffffffffa52a1bdd #3 wait_for_completion at ffffffffa529eeea <-- sleeps with delayed node mutex held #4 start_delalloc_inodes at ffffffffc0380db5 #5 btrfs_start_delalloc_snapshot at ffffffffc0393836 #6 try_flush_qgroup at ffffffffc03f04b2 #7 __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6 <-- tries to reserve space and starts delalloc inodes. #8 btrfs_delayed_update_inode at ffffffffc03e31aa <-- acquires delayed node mutex #9 btrfs_update_inode at ffffffffc0385ba8 #10 btrfs_dirty_inode at ffffffffc038627b <-- TRANSACTIION OPENED #11 touch_atime at ffffffffa4cf0000 #12 generic_file_read_iter at ffffffffa4c1f123 #13 new_sync_read at ffffffffa4ccdc8a #14 vfs_read at ffffffffa4cd0849 #15 ksys_read at ffffffffa4cd0bd1 #16 do_syscall_64 at ffffffffa4a052eb #17 entry_SYSCALL_64_after_hwframe at ffffffffa540008c This will cause an asynchronous work to flush the delalloc inodes to happen which can try to acquire the same delayed_node mutex: PID: 455 TASK: ffff8c8085fa4000 CPU: 5 COMMAND: "kworker/u16:30" #0 __schedule at ffffffffa529e07d #1 schedule at ffffffffa529e4ff #2 schedule_preempt_disabled at ffffffffa529e80a #3 __mutex_lock at ffffffffa529fdcb <-- goes to sleep, never wakes up. #4 btrfs_delayed_update_inode at ffffffffc03e3143 <-- tries to acquire the mutex #5 btrfs_update_inode at ffffffffc0385ba8 <-- this is the same inode that pid 6963 is holding #6 cow_file_range_inline.constprop.78 at ffffffffc0386be7 #7 cow_file_range at ffffffffc03879c1 #8 btrfs_run_delalloc_range at ffffffffc038894c #9 writepage_delalloc at ffffffffc03a3c8f #10 __extent_writepage at ffffffffc03a4c01 #11 extent_write_cache_pages at ffffffffc03a500b #12 extent_writepages at ffffffffc03a6de2 #13 do_writepages at ffffffffa4c277eb #14 __filemap_fdatawrite_range at ffffffffa4c1e5bb #15 btrfs_run_delalloc_work at ffffffffc0380987 <-- starts running delayed nodes #16 normal_work_helper at ffffffffc03b706c #17 process_one_work at ffffffffa4aba4e4 #18 worker_thread at ffffffffa4aba6fd #19 kthread at ffffffffa4ac0a3d #20 ret_from_fork at ffffffffa54001ff To fully address those cases the complete fix is to never issue any flushing while holding the transaction or the delayed node lock. This patch achieves it by calling qgroup_reserve_meta directly which will either succeed without flushing or will fail and return -EDQUOT. In the latter case that return value is going to be propagated to btrfs_dirty_inode which will fallback to start a new transaction. That's fine as the majority of time we expect the inode will have BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly copying the in-memory state. Fixes: c53e965 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT") CC: [email protected] # 5.10+ Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
The evlist and the cpu/thread maps should be released together. Otherwise following error was reported by Asan. Note that this test still has memory leaks in DSOs so it still fails even after this change. I'll take a look at that too. # perf test -v 26 26: Object code reading : --- start --- test child forked, pid 154184 Looking at the vmlinux_path (8 entries long) symsrc__init: build id mismatch for vmlinux. symsrc__init: cannot get elf header. Using /proc/kcore for kernel data Using /proc/kallsyms for symbols Parsing event 'cycles' mmap size 528384B ... ================================================================= ==154184==ERROR: LeakSanitizer: detected memory leaks Direct leak of 439 byte(s) in 1 object(s) allocated from: #0 0x7fcb66e77037 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x55ad9b7e821e in dso__new_id util/dso.c:1256 #2 0x55ad9b8cfd4a in __machine__addnew_vdso util/vdso.c:132 #3 0x55ad9b8cfd4a in machine__findnew_vdso util/vdso.c:347 #4 0x55ad9b845b7e in map__new util/map.c:176 #5 0x55ad9b8415a2 in machine__process_mmap2_event util/machine.c:1787 #6 0x55ad9b8fab16 in perf_tool__process_synth_event util/synthetic-events.c:64 #7 0x55ad9b8fab16 in perf_event__synthesize_mmap_events util/synthetic-events.c:499 #8 0x55ad9b8fbfdf in __event__synthesize_thread util/synthetic-events.c:741 #9 0x55ad9b8ff3e3 in perf_event__synthesize_thread_map util/synthetic-events.c:833 #10 0x55ad9b738585 in do_test_code_reading tests/code-reading.c:608 #11 0x55ad9b73b25d in test__code_reading tests/code-reading.c:722 #12 0x55ad9b6f28fb in run_test tests/builtin-test.c:428 #13 0x55ad9b6f28fb in test_and_print tests/builtin-test.c:458 #14 0x55ad9b6f4a53 in __cmd_test tests/builtin-test.c:679 #15 0x55ad9b6f4a53 in cmd_test tests/builtin-test.c:825 #16 0x55ad9b760cc4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #17 0x55ad9b5eaa88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #18 0x55ad9b5eaa88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #19 0x55ad9b5eaa88 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #20 0x7fcb669acd09 in __libc_start_main ../csu/libc-start.c:308 ... SUMMARY: AddressSanitizer: 471 byte(s) leaked in 2 allocation(s). test child finished with 1 ---- end ---- Object code reading: FAILED! Signed-off-by: Namhyung Kim <[email protected]> Acked-by: Jiri Olsa <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Leo Yan <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Pablo Neira Ayuso says: ==================== netfilter: flowtable enhancements [ This is v2 that includes documentation enhancements, including existing limitations. This is a rebase on top on net-next. ] The following patchset augments the Netfilter flowtable fastpath to support for network topologies that combine IP forwarding, bridge, classic VLAN devices, bridge VLAN filtering, DSA and PPPoE. This includes support for the flowtable software and hardware datapaths. The following pictures provides an example scenario: fast path! .------------------------. / \ | IP forwarding | | / \ \/ | br0 wan ..... eth0 . / \ host C -> veth1 veth2 . switch/router . . eth0 host A The bridge master device 'br0' has an IP address and a DHCP server is also assumed to be running to provide connectivity to host A which reaches the Internet through 'br0' as default gateway. Then, packet enters the IP forwarding path and Netfilter is used to NAT the packets before they leave through the wan device. The general idea is to accelerate forwarding by building a fast path that takes packets from the ingress path of the bridge port and place them in the egress path of the wan device (and vice versa). Hence, skipping the classic bridge and IP stack paths. ** Patch from #1 to #6 add the infrastructure which describes the list of netdevice hops to reach a given destination MAC address in the local network topology. Patch #1 adds dev_fill_forward_path() and .ndo_fill_forward_path() to netdev_ops. Patch #2 adds .ndo_fill_forward_path for vlan devices, which provides the next device hop via vlan->real_dev, the vlan ID and the protocol. Patch #3 adds .ndo_fill_forward_path for bridge devices, which allows to make lookups to the FDB to locate the next device hop (bridge port) in the forwarding path. Patch #4 extends bridge .ndo_fill_forward_path to support for bridge VLAN filtering. Patch #5 adds .ndo_fill_forward_path for PPPoE devices. Patch #6 adds .ndo_fill_forward_path for DSA. Patches from #7 to #14 update the flowtable software datapath: Patch #7 adds the transmit path type field to the flow tuple. Two transmit paths are supported so far: the neighbour and the xfrm transmit paths. Patch #8 and #9 update the flowtable datapath to use dev_fill_forward_path() to obtain the real ingress/egress device for the flowtable datapath. This adds the new ethernet xmit direct path to the flowtable. Patch #10 adds native flowtable VLAN support (up to 2 VLAN tags) through dev_fill_forward_path(). The flowtable stores the VLAN id and protocol in the flow tuple. Patch #11 adds native flowtable bridge VLAN filter support through dev_fill_forward_path(). Patch #12 adds native flowtable bridge PPPoE through dev_fill_forward_path(). Patch #13 adds DSA support through dev_fill_forward_path(). Patch #14 extends flowtable selftests to cover for flowtable software datapath enhancements. ** Patches from #15 to #20 update the flowtable hardware offload datapath: Patch #15 extends the flowtable hardware offload to support for the direct ethernet xmit path. This also includes VLAN support. Patch #16 stores the egress real device in the flow tuple. The software flowtable datapath uses dev_hard_header() to transmit packets, hence it might refer to VLAN/DSA/PPPoE software device, not the real ethernet device. Patch #17 deals with switchdev PVID hardware offload to skip it on egress. Patch #18 adds FLOW_ACTION_PPPOE_PUSH to the flow_offload action API. Patch #19 extends the flowtable hardware offload to support for PPPoE Patch #20 adds TC_SETUP_FT support for DSA. ** Patches from #20 to #23: Felix Fietkau adds a new driver which support hardware offload for the mtk PPE engine through the existing flow offload API which supports for the flowtable enhancements coming in this batch. Patch #24 extends the documentation and describe existing limitations. Please, apply, thanks. ==================== Signed-off-by: David S. Miller <[email protected]>
I got several memory leak reports from Asan with a simple command. It was because VDSO is not released due to the refcount. Like in __dsos_addnew_id(), it should put the refcount after adding to the list. $ perf record true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.030 MB perf.data (10 samples) ] ================================================================= ==692599==ERROR: LeakSanitizer: detected memory leaks Direct leak of 439 byte(s) in 1 object(s) allocated from: #0 0x7fea52341037 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x559bce4aa8ee in dso__new_id util/dso.c:1256 #2 0x559bce59245a in __machine__addnew_vdso util/vdso.c:132 #3 0x559bce59245a in machine__findnew_vdso util/vdso.c:347 #4 0x559bce50826c in map__new util/map.c:175 #5 0x559bce503c92 in machine__process_mmap2_event util/machine.c:1787 #6 0x559bce512f6b in machines__deliver_event util/session.c:1481 #7 0x559bce515107 in perf_session__deliver_event util/session.c:1551 #8 0x559bce51d4d2 in do_flush util/ordered-events.c:244 #9 0x559bce51d4d2 in __ordered_events__flush util/ordered-events.c:323 #10 0x559bce519bea in __perf_session__process_events util/session.c:2268 #11 0x559bce519bea in perf_session__process_events util/session.c:2297 #12 0x559bce2e7a52 in process_buildids /home/namhyung/project/linux/tools/perf/builtin-record.c:1017 #13 0x559bce2e7a52 in record__finish_output /home/namhyung/project/linux/tools/perf/builtin-record.c:1234 #14 0x559bce2ed4f6 in __cmd_record /home/namhyung/project/linux/tools/perf/builtin-record.c:2026 #15 0x559bce2ed4f6 in cmd_record /home/namhyung/project/linux/tools/perf/builtin-record.c:2858 #16 0x559bce422db4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #17 0x559bce2acac8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #18 0x559bce2acac8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #19 0x559bce2acac8 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #20 0x7fea51e76d09 in __libc_start_main ../csu/libc-start.c:308 Indirect leak of 32 byte(s) in 1 object(s) allocated from: #0 0x7fea52341037 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x559bce520907 in nsinfo__copy util/namespaces.c:169 #2 0x559bce50821b in map__new util/map.c:168 #3 0x559bce503c92 in machine__process_mmap2_event util/machine.c:1787 #4 0x559bce512f6b in machines__deliver_event util/session.c:1481 #5 0x559bce515107 in perf_session__deliver_event util/session.c:1551 #6 0x559bce51d4d2 in do_flush util/ordered-events.c:244 #7 0x559bce51d4d2 in __ordered_events__flush util/ordered-events.c:323 #8 0x559bce519bea in __perf_session__process_events util/session.c:2268 #9 0x559bce519bea in perf_session__process_events util/session.c:2297 #10 0x559bce2e7a52 in process_buildids /home/namhyung/project/linux/tools/perf/builtin-record.c:1017 #11 0x559bce2e7a52 in record__finish_output /home/namhyung/project/linux/tools/perf/builtin-record.c:1234 #12 0x559bce2ed4f6 in __cmd_record /home/namhyung/project/linux/tools/perf/builtin-record.c:2026 #13 0x559bce2ed4f6 in cmd_record /home/namhyung/project/linux/tools/perf/builtin-record.c:2858 #14 0x559bce422db4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #15 0x559bce2acac8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #16 0x559bce2acac8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #17 0x559bce2acac8 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #18 0x7fea51e76d09 in __libc_start_main ../csu/libc-start.c:308 SUMMARY: AddressSanitizer: 471 byte(s) leaked in 2 allocation(s). Signed-off-by: Namhyung Kim <[email protected]> Acked-by: Jiri Olsa <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lore.kernel.org/lkml/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Rfkill block and unblock Intel USB Bluetooth [8087:0026] may make it stops working: [ 509.691509] Bluetooth: hci0: HCI reset during shutdown failed [ 514.897584] Bluetooth: hci0: MSFT filter_enable is already on [ 530.044751] usb 3-10: reset full-speed USB device number 5 using xhci_hcd [ 545.660350] usb 3-10: device descriptor read/64, error -110 [ 561.283530] usb 3-10: device descriptor read/64, error -110 [ 561.519682] usb 3-10: reset full-speed USB device number 5 using xhci_hcd [ 566.686650] Bluetooth: hci0: unexpected event for opcode 0x0500 [ 568.752452] Bluetooth: hci0: urb 0000000096cd309b failed to resubmit (113) [ 578.797955] Bluetooth: hci0: Failed to read MSFT supported features (-110) [ 586.286565] Bluetooth: hci0: urb 00000000c522f633 failed to resubmit (113) [ 596.215302] Bluetooth: hci0: Failed to read MSFT supported features (-110) Or kernel panics because other workqueues already freed skb: [ 2048.663763] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 2048.663775] #PF: supervisor read access in kernel mode [ 2048.663779] #PF: error_code(0x0000) - not-present page [ 2048.663782] PGD 0 P4D 0 [ 2048.663787] Oops: 0000 [#1] SMP NOPTI [ 2048.663793] CPU: 3 PID: 4491 Comm: rfkill Tainted: G W 5.13.0-rc1-next-20210510+ #20 [ 2048.663799] Hardware name: HP HP EliteBook 850 G8 Notebook PC/8846, BIOS T76 Ver. 01.01.04 12/02/2020 [ 2048.663801] RIP: 0010:__skb_ext_put+0x6/0x50 [ 2048.663814] Code: 8b 1b 48 85 db 75 db 5b 41 5c 5d c3 be 01 00 00 00 e8 de 13 c0 ff eb e7 be 02 00 00 00 e8 d2 13 c0 ff eb db 0f 1f 44 00 00 55 <8b> 07 48 89 e5 83 f8 01 74 14 b8 ff ff ff ff f0 0f c1 07 83 f8 01 [ 2048.663819] RSP: 0018:ffffc1d105b6fd80 EFLAGS: 00010286 [ 2048.663824] RAX: 0000000000000000 RBX: ffff9d9ac5649000 RCX: 0000000000000000 [ 2048.663827] RDX: ffffffffc0d1daf6 RSI: 0000000000000206 RDI: 0000000000000000 [ 2048.663830] RBP: ffffc1d105b6fd98 R08: 0000000000000001 R09: ffff9d9ace8ceac0 [ 2048.663834] R10: ffff9d9ace8ceac0 R11: 0000000000000001 R12: ffff9d9ac5649000 [ 2048.663838] R13: 0000000000000000 R14: 00007ffe0354d650 R15: 0000000000000000 [ 2048.663843] FS: 00007fe02ab19740(0000) GS:ffff9d9e5f8c0000(0000) knlGS:0000000000000000 [ 2048.663849] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2048.663853] CR2: 0000000000000000 CR3: 0000000111a52004 CR4: 0000000000770ee0 [ 2048.663856] PKRU: 55555554 [ 2048.663859] Call Trace: [ 2048.663865] ? skb_release_head_state+0x5e/0x80 [ 2048.663873] kfree_skb+0x2f/0xb0 [ 2048.663881] btusb_shutdown_intel_new+0x36/0x60 [btusb] [ 2048.663905] hci_dev_do_close+0x48c/0x5e0 [bluetooth] [ 2048.663954] ? __cond_resched+0x1a/0x50 [ 2048.663962] hci_rfkill_set_block+0x56/0xa0 [bluetooth] [ 2048.664007] rfkill_set_block+0x98/0x170 [ 2048.664016] rfkill_fop_write+0x136/0x1e0 [ 2048.664022] vfs_write+0xc7/0x260 [ 2048.664030] ksys_write+0xb1/0xe0 [ 2048.664035] ? exit_to_user_mode_prepare+0x37/0x1c0 [ 2048.664042] __x64_sys_write+0x1a/0x20 [ 2048.664048] do_syscall_64+0x40/0xb0 [ 2048.664055] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 2048.664060] RIP: 0033:0x7fe02ac23c27 [ 2048.664066] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 2048.664070] RSP: 002b:00007ffe0354d638 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 2048.664075] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe02ac23c27 [ 2048.664078] RDX: 0000000000000008 RSI: 00007ffe0354d650 RDI: 0000000000000003 [ 2048.664081] RBP: 0000000000000000 R08: 0000559b05998440 R09: 0000559b05998440 [ 2048.664084] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 [ 2048.664086] R13: 0000000000000000 R14: ffffffff00000000 R15: 00000000ffffffff So move the shutdown callback to a place where workqueues are either flushed or cancelled to resolve the issue. Signed-off-by: Kai-Heng Feng <[email protected]> Signed-off-by: Marcel Holtmann <[email protected]>
When the kernel is built with CONFIG_KASAN_HW_TAGS and the CPU supports MTE, memory accesses are checked at 16-byte granularity, and out-of-bounds accesses can result in tag check faults. Our current implementation of strlen() makes unaligned 16-byte accesses (within a naturally aligned 4096-byte window), and can trigger tag check faults. This can be seen at boot time, e.g. | BUG: KASAN: invalid-access in __pi_strlen+0x14/0x150 | Read at addr f4ff0000c0028300 by task swapper/0/0 | Pointer tag: [f4], memory tag: [fe] | | CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.0-09550-g03c2813535a2-dirty #20 | Hardware name: linux,dummy-virt (DT) | Call trace: | dump_backtrace+0x0/0x1b0 | show_stack+0x1c/0x30 | dump_stack_lvl+0x68/0x84 | print_address_description+0x7c/0x2b4 | kasan_report+0x138/0x38c | __do_kernel_fault+0x190/0x1c4 | do_tag_check_fault+0x78/0x90 | do_mem_abort+0x44/0xb4 | el1_abort+0x40/0x60 | el1h_64_sync_handler+0xb0/0xd0 | el1h_64_sync+0x78/0x7c | __pi_strlen+0x14/0x150 | __register_sysctl_table+0x7c4/0x890 | register_leaf_sysctl_tables+0x1a4/0x210 | register_leaf_sysctl_tables+0xc8/0x210 | __register_sysctl_paths+0x22c/0x290 | register_sysctl_table+0x2c/0x40 | sysctl_init+0x20/0x30 | proc_sys_init+0x3c/0x48 | proc_root_init+0x80/0x9c | start_kernel+0x640/0x69c | __primary_switched+0xc0/0xc8 To fix this, we can reduce the (strlen-internal) MIN_PAGE_SIZE to 16 bytes when CONFIG_KASAN_HW_TAGS is selected. This will cause strlen() to align the base pointer downwards to a 16-byte boundary, and to discard the additional prefix bytes without counting them. All subsequent accesses will be 16-byte aligned 16-byte LDPs. While the comments say the body of the loop will access 32 bytes, this is performed as two 16-byte acceses, with the second made only if the first did not encounter a NUL byte, so the body of the loop will not over-read across a 16-byte boundary. No other string routines are affected. The other str*() routines will not make any access which straddles a 16-byte boundary, and the mem*() routines will only make acceses which straddle a 16-byte boundary when which is entirely within the bounds of the relevant base and size arguments. Fixes: 325a1de ("arm64: Import updated version of Cortex Strings' strlen") Signed-off-by: Mark Rutland <[email protected]> Cc: Alexander Potapenko <[email protected] Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Marco Elver <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Will Deacon <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Reviewed-by: Robin Murphy <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
This patch adds "-j" mode to test_progs, executing tests in multiple process. "-j" mode is optional, and works with all existing test selection mechanism, as well as "-v", "-l" etc. In "-j" mode, main process use UDS/DGRAM to communicate to each forked worker, commanding it to run tests and collect logs. After all tests are finished, a summary is printed. main process use multiple competing threads to dispatch work to worker, trying to keep them all busy. Example output: > ./test_progs -n 15-20 -j [ 8.584709] bpf_testmod: loading out-of-tree module taints kernel. Launching 2 workers. [0]: Running test 15. [1]: Running test 16. [1]: Running test 17. [1]: Running test 18. [1]: Running test 19. [1]: Running test 20. [1]: worker exit. [0]: worker exit. #15 btf_dump:OK #16 btf_endian:OK #17 btf_map_in_map:OK #18 btf_module:OK #19 btf_skc_cls_ingress:OK #20 btf_split:OK Summary: 6/20 PASSED, 0 SKIPPED, 0 FAILED Know issue: Some tests fail when running concurrently, later patch will either fix the test or pin them to worker 0. Signed-off-by: Yucong Sun <[email protected]> V3 -> V2: fix missing outputs in commit messages. V2 -> V1: switch to UDS client/server model.
This patch adds "-j" mode to test_progs, executing tests in multiple process. "-j" mode is optional, and works with all existing test selection mechanism, as well as "-v", "-l" etc. In "-j" mode, main process use UDS/DGRAM to communicate to each forked worker, commanding it to run tests and collect logs. After all tests are finished, a summary is printed. main process use multiple competing threads to dispatch work to worker, trying to keep them all busy. Example output: > ./test_progs -n 15-20 -j [ 8.584709] bpf_testmod: loading out-of-tree module taints kernel. Launching 2 workers. [0]: Running test 15. [1]: Running test 16. [1]: Running test 17. [1]: Running test 18. [1]: Running test 19. [1]: Running test 20. [1]: worker exit. [0]: worker exit. #15 btf_dump:OK #16 btf_endian:OK #17 btf_map_in_map:OK #18 btf_module:OK #19 btf_skc_cls_ingress:OK #20 btf_split:OK Summary: 6/20 PASSED, 0 SKIPPED, 0 FAILED Know issue: Some tests fail when running concurrently, later patch will either fix the test or pin them to worker 0. Signed-off-by: Yucong Sun <[email protected]> V3 -> V2: fix missing outputs in commit messages. V2 -> V1: switch to UDS client/server model.
This patch adds "-j" mode to test_progs, executing tests in multiple process. "-j" mode is optional, and works with all existing test selection mechanism, as well as "-v", "-l" etc. In "-j" mode, main process use UDS/DGRAM to communicate to each forked worker, commanding it to run tests and collect logs. After all tests are finished, a summary is printed. main process use multiple competing threads to dispatch work to worker, trying to keep them all busy. Example output: > ./test_progs -n 15-20 -j [ 8.584709] bpf_testmod: loading out-of-tree module taints kernel. Launching 2 workers. [0]: Running test 15. [1]: Running test 16. [1]: Running test 17. [1]: Running test 18. [1]: Running test 19. [1]: Running test 20. [1]: worker exit. [0]: worker exit. #15 btf_dump:OK #16 btf_endian:OK #17 btf_map_in_map:OK #18 btf_module:OK #19 btf_skc_cls_ingress:OK #20 btf_split:OK Summary: 6/20 PASSED, 0 SKIPPED, 0 FAILED Know issue: Some tests fail when running concurrently, later patch will either fix the test or pin them to worker 0. Signed-off-by: Yucong Sun <[email protected]> V3 -> V2: fix missing outputs in commit messages. V2 -> V1: switch to UDS client/server model.
This patch adds "-j" mode to test_progs, executing tests in multiple process. "-j" mode is optional, and works with all existing test selection mechanism, as well as "-v", "-l" etc. In "-j" mode, main process use UDS/DGRAM to communicate to each forked worker, commanding it to run tests and collect logs. After all tests are finished, a summary is printed. main process use multiple competing threads to dispatch work to worker, trying to keep them all busy. Example output: > ./test_progs -n 15-20 -j [ 8.584709] bpf_testmod: loading out-of-tree module taints kernel. Launching 2 workers. [0]: Running test 15. [1]: Running test 16. [1]: Running test 17. [1]: Running test 18. [1]: Running test 19. [1]: Running test 20. [1]: worker exit. [0]: worker exit. #15 btf_dump:OK #16 btf_endian:OK #17 btf_map_in_map:OK #18 btf_module:OK #19 btf_skc_cls_ingress:OK #20 btf_split:OK Summary: 6/20 PASSED, 0 SKIPPED, 0 FAILED Know issue: Some tests fail when running concurrently, later patch will either fix the test or pin them to worker 0. Signed-off-by: Yucong Sun <[email protected]> V4 -> V3: address style warnings. V3 -> V2: fix missing outputs in commit messages. V2 -> V1: switch to UDS client/server model.
This patch adds "-j" mode to test_progs, executing tests in multiple process. "-j" mode is optional, and works with all existing test selection mechanism, as well as "-v", "-l" etc. In "-j" mode, main process use UDS/DGRAM to communicate to each forked worker, commanding it to run tests and collect logs. After all tests are finished, a summary is printed. main process use multiple competing threads to dispatch work to worker, trying to keep them all busy. Example output: > ./test_progs -n 15-20 -j [ 8.584709] bpf_testmod: loading out-of-tree module taints kernel. Launching 2 workers. [0]: Running test 15. [1]: Running test 16. [1]: Running test 17. [1]: Running test 18. [1]: Running test 19. [1]: Running test 20. [1]: worker exit. [0]: worker exit. #15 btf_dump:OK #16 btf_endian:OK #17 btf_map_in_map:OK #18 btf_module:OK #19 btf_skc_cls_ingress:OK #20 btf_split:OK Summary: 6/20 PASSED, 0 SKIPPED, 0 FAILED Know issue: Some tests fail when running concurrently, later patch will either fix the test or pin them to worker 0. Signed-off-by: Yucong Sun <[email protected]> V4 -> V3: address style warnings. V3 -> V2: fix missing outputs in commit messages. V2 -> V1: switch to UDS client/server model.
This patch adds "-j" mode to test_progs, executing tests in multiple process. "-j" mode is optional, and works with all existing test selection mechanism, as well as "-v", "-l" etc. In "-j" mode, main process use UDS/DGRAM to communicate to each forked worker, commanding it to run tests and collect logs. After all tests are finished, a summary is printed. main process use multiple competing threads to dispatch work to worker, trying to keep them all busy. Example output: > ./test_progs -n 15-20 -j [ 8.584709] bpf_testmod: loading out-of-tree module taints kernel. Launching 2 workers. [0]: Running test 15. [1]: Running test 16. [1]: Running test 17. [1]: Running test 18. [1]: Running test 19. [1]: Running test 20. [1]: worker exit. [0]: worker exit. #15 btf_dump:OK #16 btf_endian:OK #17 btf_map_in_map:OK #18 btf_module:OK #19 btf_skc_cls_ingress:OK #20 btf_split:OK Summary: 6/20 PASSED, 0 SKIPPED, 0 FAILED Know issue: Some tests fail when running concurrently, later patch will either fix the test or pin them to worker 0. Signed-off-by: Yucong Sun <[email protected]> V4 -> V3: address style warnings. V3 -> V2: fix missing outputs in commit messages. V2 -> V1: switch to UDS client/server model.
This patch adds "-j" mode to test_progs, executing tests in multiple process. "-j" mode is optional, and works with all existing test selection mechanism, as well as "-v", "-l" etc. In "-j" mode, main process use UDS/DGRAM to communicate to each forked worker, commanding it to run tests and collect logs. After all tests are finished, a summary is printed. main process use multiple competing threads to dispatch work to worker, trying to keep them all busy. Example output: > ./test_progs -n 15-20 -j [ 8.584709] bpf_testmod: loading out-of-tree module taints kernel. Launching 2 workers. [0]: Running test 15. [1]: Running test 16. [1]: Running test 17. [1]: Running test 18. [1]: Running test 19. [1]: Running test 20. [1]: worker exit. [0]: worker exit. #15 btf_dump:OK #16 btf_endian:OK #17 btf_map_in_map:OK #18 btf_module:OK #19 btf_skc_cls_ingress:OK #20 btf_split:OK Summary: 6/20 PASSED, 0 SKIPPED, 0 FAILED Know issue: Some tests fail when running concurrently, later patch will either fix the test or pin them to worker 0. Signed-off-by: Yucong Sun <[email protected]> V4 -> V3: address style warnings. V3 -> V2: fix missing outputs in commit messages. V2 -> V1: switch to UDS client/server model.
This patch adds "-j" mode to test_progs, executing tests in multiple process. "-j" mode is optional, and works with all existing test selection mechanism, as well as "-v", "-l" etc. In "-j" mode, main process use UDS to communicate to each forked worker, commanding it to run tests and collect logs. After all tests are finished, a summary is printed. main process use multiple competing threads to dispatch work to worker, trying to keep them all busy. The test status will be printed as soon as it is finished, if there are error logs, it will be printed after the final summary line. By specifying "--debug", additional debug information on server/worker communication will be printed. Example output: > ./test_progs -n 15-20 -j [ 12.801730] bpf_testmod: loading out-of-tree module taints kernel. Launching 8 workers. #20 btf_split:OK #16 btf_endian:OK #18 btf_module:OK #17 btf_map_in_map:OK #19 btf_skc_cls_ingress:OK #15 btf_dump:OK Summary: 6/20 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Yucong Sun <[email protected]>
This patch adds "-j" mode to test_progs, executing tests in multiple process. "-j" mode is optional, and works with all existing test selection mechanism, as well as "-v", "-l" etc. In "-j" mode, main process use UDS to communicate to each forked worker, commanding it to run tests and collect logs. After all tests are finished, a summary is printed. main process use multiple competing threads to dispatch work to worker, trying to keep them all busy. The test status will be printed as soon as it is finished, if there are error logs, it will be printed after the final summary line. By specifying "--debug", additional debug information on server/worker communication will be printed. Example output: > ./test_progs -n 15-20 -j [ 12.801730] bpf_testmod: loading out-of-tree module taints kernel. Launching 8 workers. #20 btf_split:OK #16 btf_endian:OK #18 btf_module:OK #17 btf_map_in_map:OK #19 btf_skc_cls_ingress:OK #15 btf_dump:OK Summary: 6/20 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Yucong Sun <[email protected]>
This patch adds "-j" mode to test_progs, executing tests in multiple process. "-j" mode is optional, and works with all existing test selection mechanism, as well as "-v", "-l" etc. In "-j" mode, main process use UDS to communicate to each forked worker, commanding it to run tests and collect logs. After all tests are finished, a summary is printed. main process use multiple competing threads to dispatch work to worker, trying to keep them all busy. The test status will be printed as soon as it is finished, if there are error logs, it will be printed after the final summary line. By specifying "--debug", additional debug information on server/worker communication will be printed. Example output: > ./test_progs -n 15-20 -j [ 12.801730] bpf_testmod: loading out-of-tree module taints kernel. Launching 8 workers. #20 btf_split:OK #16 btf_endian:OK #18 btf_module:OK #17 btf_map_in_map:OK #19 btf_skc_cls_ingress:OK #15 btf_dump:OK Summary: 6/20 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Yucong Sun <[email protected]>
When creating ceq_0 during probing irdma, cqp.sc_cqp will be sent as a cqp_request to cqp->sc_cqp.sq_ring. If the request is pending when removing the irdma driver or unplugging its aux device, cqp.sc_cqp will be dereferenced as wrong struct in irdma_free_pending_cqp_request(). PID: 3669 TASK: ffff88aef892c000 CPU: 28 COMMAND: "kworker/28:0" #0 [fffffe0000549e38] crash_nmi_callback at ffffffff810e3a34 #1 [fffffe0000549e40] nmi_handle at ffffffff810788b2 #2 [fffffe0000549ea0] default_do_nmi at ffffffff8107938f #3 [fffffe0000549eb8] do_nmi at ffffffff81079582 #4 [fffffe0000549ef0] end_repeat_nmi at ffffffff82e016b4 [exception RIP: native_queued_spin_lock_slowpath+1291] RIP: ffffffff8127e72b RSP: ffff88aa841ef778 RFLAGS: 00000046 RAX: 0000000000000000 RBX: ffff88b01f849700 RCX: ffffffff8127e47e RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffff83857ec0 RBP: ffff88afe3e4efc8 R8: ffffed15fc7c9dfa R9: ffffed15fc7c9dfa R10: 0000000000000001 R11: ffffed15fc7c9df9 R12: 0000000000740000 R13: ffff88b01f849708 R14: 0000000000000003 R15: ffffed1603f092e1 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 -- <NMI exception stack> -- #5 [ffff88aa841ef778] native_queued_spin_lock_slowpath at ffffffff8127e72b #6 [ffff88aa841ef7b0] _raw_spin_lock_irqsave at ffffffff82c22aa4 #7 [ffff88aa841ef7c8] __wake_up_common_lock at ffffffff81257363 #8 [ffff88aa841ef888] irdma_free_pending_cqp_request at ffffffffa0ba12cc [irdma] #9 [ffff88aa841ef958] irdma_cleanup_pending_cqp_op at ffffffffa0ba1469 [irdma] #10 [ffff88aa841ef9c0] irdma_ctrl_deinit_hw at ffffffffa0b2989f [irdma] #11 [ffff88aa841efa28] irdma_remove at ffffffffa0b252df [irdma] #12 [ffff88aa841efae8] auxiliary_bus_remove at ffffffff8219afdb #13 [ffff88aa841efb00] device_release_driver_internal at ffffffff821882e6 #14 [ffff88aa841efb38] bus_remove_device at ffffffff82184278 #15 [ffff88aa841efb88] device_del at ffffffff82179d23 #16 [ffff88aa841efc48] ice_unplug_aux_dev at ffffffffa0eb1c14 [ice] #17 [ffff88aa841efc68] ice_service_task at ffffffffa0d88201 [ice] #18 [ffff88aa841efde8] process_one_work at ffffffff811c589a #19 [ffff88aa841efe60] worker_thread at ffffffff811c71ff #20 [ffff88aa841eff10] kthread at ffffffff811d87a0 #21 [ffff88aa841eff50] ret_from_fork at ffffffff82e0022f Fixes: 44d9e52 ("RDMA/irdma: Implement device initialization definitions") Link: https://lore.kernel.org/r/[email protected] Suggested-by: "Ismail, Mustafa" <[email protected]> Signed-off-by: Shifeng Li <[email protected]> Reviewed-by: Shiraz Saleem <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
Commit a5a9230 (fbdev: fbcon: Properly revert changes when vc_resize() failed) started restoring old font data upon failure (of vc_resize()). But it performs so only for user fonts. It means that the "system"/internal fonts are not restored at all. So in result, the very first call to fbcon_do_set_font() performs no restore at all upon failing vc_resize(). This can be reproduced by Syzkaller to crash the system on the next invocation of font_get(). It's rather hard to hit the allocation failure in vc_resize() on the first font_set(), but not impossible. Esp. if fault injection is used to aid the execution/failure. It was demonstrated by Sirius: BUG: unable to handle page fault for address: fffffffffffffff8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD cb7b067 P4D cb7b067 PUD cb7d067 PMD 0 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 8007 Comm: poc Not tainted 6.7.0-g9d1694dc91ce #20 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:fbcon_get_font+0x229/0x800 drivers/video/fbdev/core/fbcon.c:2286 Call Trace: <TASK> con_font_get drivers/tty/vt/vt.c:4558 [inline] con_font_op+0x1fc/0xf20 drivers/tty/vt/vt.c:4673 vt_k_ioctl drivers/tty/vt/vt_ioctl.c:474 [inline] vt_ioctl+0x632/0x2ec0 drivers/tty/vt/vt_ioctl.c:752 tty_ioctl+0x6f8/0x1570 drivers/tty/tty_io.c:2803 vfs_ioctl fs/ioctl.c:51 [inline] ... So restore the font data in any case, not only for user fonts. Note the later 'if' is now protected by 'old_userfont' and not 'old_data' as the latter is always set now. (And it is supposed to be non-NULL. Otherwise we would see the bug above again.) Signed-off-by: Jiri Slaby (SUSE) <[email protected]> Fixes: a5a9230 ("fbdev: fbcon: Properly revert changes when vc_resize() failed") Reported-and-tested-by: Ubisectech Sirius <[email protected]> Cc: Ubisectech Sirius <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Helge Deller <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Daniel Vetter <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Hi Luiz, could you review this patch? This patch prevents a div-by-zero error and potential int overflow by adding a range check for MTU in hci_cc_le_read_buffer_size() and hci_cc_le_read_buffer_size_v2(). Also, hci_connect_le() will refuse to allocate hcon if the MTU is not in the valid range. Bug description: l2cap_le_flowctl_init() can cause both div-by-zero and an integer overflow. l2cap_le_flowctl_init() chan->mps = min_t(u16, chan->imtu, chan->conn->mtu - L2CAP_HDR_SIZE); chan->rx_credits = (chan->imtu / chan->mps) + 1; <- div-by-zero Here, chan->conn->mtu could be less than or equal to L2CAP_HDR_SIZE (4). If mtu is 4, it causes div-by-zero. If mtu is less than 4, it causes an integer overflow. How mtu could have such low value: hci_cc_le_read_buffer_size() hdev->le_mtu = __le16_to_cpu(rp->le_mtu); l2cap_conn_add() conn->mtu = hcon->hdev->le_mtu; As shown, mtu is an input from an HCI device. So, any HCI device can set mtu value to any value, such as lower than 4. According to the spec v5.4 7.8.2 LE Read Buffer Size command, the value should be fall in [0x001b, 0xffff]. Thank you, Sungwoo. divide error: 0000 [kernel-patches#1] PREEMPT SMP KASAN NOPTI CPU: 0 PID: 67 Comm: kworker/u5:0 Tainted: G W 6.9.0-rc5+ kernel-patches#20 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Workqueue: hci0 hci_rx_work RIP: 0010:l2cap_le_flowctl_init+0x19e/0x3f0 net/bluetooth/l2cap_core.c:547 Code: e8 17 17 0c 00 66 41 89 9f 84 00 00 00 bf 01 00 00 00 41 b8 02 00 00 00 4c 89 fe 4c 89 e2 89 d9 e8 27 17 0c 00 44 89 f0 31 d2 <66> f7 f3 89 c3 ff c3 4d 8d b7 88 00 00 00 4c 89 f0 48 c1 e8 03 42 RSP: 0018:ffff88810bc0f858 EFLAGS: 00010246 RAX: 00000000000002a0 RBX: 0000000000000000 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: ffff88810bc0f7c0 RDI: ffffc90002dcb66f RBP: ffff88810bc0f880 R08: aa69db2dda70ff01 R09: 0000ffaaaaaaaaaa R10: 0084000000ffaaaa R11: 0000000000000000 R12: ffff88810d65a084 R13: dffffc0000000000 R14: 00000000000002a0 R15: ffff88810d65a000 FS: 0000000000000000(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000100 CR3: 0000000103268003 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <TASK> l2cap_le_connect_req net/bluetooth/l2cap_core.c:4902 [inline] l2cap_le_sig_cmd net/bluetooth/l2cap_core.c:5420 [inline] l2cap_le_sig_channel net/bluetooth/l2cap_core.c:5486 [inline] l2cap_recv_frame+0xe59d/0x11710 net/bluetooth/l2cap_core.c:6809 l2cap_recv_acldata+0x544/0x10a0 net/bluetooth/l2cap_core.c:7506 hci_acldata_packet net/bluetooth/hci_core.c:3939 [inline] hci_rx_work+0x5e5/0xb20 net/bluetooth/hci_core.c:4176 process_one_work kernel/workqueue.c:3254 [inline] process_scheduled_works+0x90f/0x1530 kernel/workqueue.c:3335 worker_thread+0x926/0xe70 kernel/workqueue.c:3416 kthread+0x2e3/0x380 kernel/kthread.c:388 ret_from_fork+0x5c/0x90 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- Signed-off-by: Sungwoo Kim <[email protected]> Signed-off-by: NipaLocal <nipa@local>
Hi Luiz, could you review this patch? This patch prevents a div-by-zero error and potential int overflow by adding a range check for MTU in hci_cc_le_read_buffer_size() and hci_cc_le_read_buffer_size_v2(). Also, hci_connect_le() will refuse to allocate hcon if the MTU is not in the valid range. Bug description: l2cap_le_flowctl_init() can cause both div-by-zero and an integer overflow. l2cap_le_flowctl_init() chan->mps = min_t(u16, chan->imtu, chan->conn->mtu - L2CAP_HDR_SIZE); chan->rx_credits = (chan->imtu / chan->mps) + 1; <- div-by-zero Here, chan->conn->mtu could be less than or equal to L2CAP_HDR_SIZE (4). If mtu is 4, it causes div-by-zero. If mtu is less than 4, it causes an integer overflow. How mtu could have such low value: hci_cc_le_read_buffer_size() hdev->le_mtu = __le16_to_cpu(rp->le_mtu); l2cap_conn_add() conn->mtu = hcon->hdev->le_mtu; As shown, mtu is an input from an HCI device. So, any HCI device can set mtu value to any value, such as lower than 4. According to the spec v5.4 7.8.2 LE Read Buffer Size command, the value should be fall in [0x001b, 0xffff]. Thank you, Sungwoo. divide error: 0000 [kernel-patches#1] PREEMPT SMP KASAN NOPTI CPU: 0 PID: 67 Comm: kworker/u5:0 Tainted: G W 6.9.0-rc5+ kernel-patches#20 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Workqueue: hci0 hci_rx_work RIP: 0010:l2cap_le_flowctl_init+0x19e/0x3f0 net/bluetooth/l2cap_core.c:547 Code: e8 17 17 0c 00 66 41 89 9f 84 00 00 00 bf 01 00 00 00 41 b8 02 00 00 00 4c 89 fe 4c 89 e2 89 d9 e8 27 17 0c 00 44 89 f0 31 d2 <66> f7 f3 89 c3 ff c3 4d 8d b7 88 00 00 00 4c 89 f0 48 c1 e8 03 42 RSP: 0018:ffff88810bc0f858 EFLAGS: 00010246 RAX: 00000000000002a0 RBX: 0000000000000000 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: ffff88810bc0f7c0 RDI: ffffc90002dcb66f RBP: ffff88810bc0f880 R08: aa69db2dda70ff01 R09: 0000ffaaaaaaaaaa R10: 0084000000ffaaaa R11: 0000000000000000 R12: ffff88810d65a084 R13: dffffc0000000000 R14: 00000000000002a0 R15: ffff88810d65a000 FS: 0000000000000000(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000100 CR3: 0000000103268003 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <TASK> l2cap_le_connect_req net/bluetooth/l2cap_core.c:4902 [inline] l2cap_le_sig_cmd net/bluetooth/l2cap_core.c:5420 [inline] l2cap_le_sig_channel net/bluetooth/l2cap_core.c:5486 [inline] l2cap_recv_frame+0xe59d/0x11710 net/bluetooth/l2cap_core.c:6809 l2cap_recv_acldata+0x544/0x10a0 net/bluetooth/l2cap_core.c:7506 hci_acldata_packet net/bluetooth/hci_core.c:3939 [inline] hci_rx_work+0x5e5/0xb20 net/bluetooth/hci_core.c:4176 process_one_work kernel/workqueue.c:3254 [inline] process_scheduled_works+0x90f/0x1530 kernel/workqueue.c:3335 worker_thread+0x926/0xe70 kernel/workqueue.c:3416 kthread+0x2e3/0x380 kernel/kthread.c:388 ret_from_fork+0x5c/0x90 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- Signed-off-by: Sungwoo Kim <[email protected]> Signed-off-by: NipaLocal <nipa@local>
l2cap_le_flowctl_init() can cause both div-by-zero and an integer overflow since hdev->le_mtu may not fall in the valid range. Move MTU from hci_dev to hci_conn to validate MTU and stop the connection process earlier if MTU is invalid. Also, add a missing validation in read_buffer_size() and make it return an error value if the validation fails. Now hci_conn_add() returns ERR_PTR() as it can fail due to the both a kzalloc failure and invalid MTU value. divide error: 0000 [kernel-patches#1] PREEMPT SMP KASAN NOPTI CPU: 0 PID: 67 Comm: kworker/u5:0 Tainted: G W 6.9.0-rc5+ kernel-patches#20 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Workqueue: hci0 hci_rx_work RIP: 0010:l2cap_le_flowctl_init+0x19e/0x3f0 net/bluetooth/l2cap_core.c:547 Code: e8 17 17 0c 00 66 41 89 9f 84 00 00 00 bf 01 00 00 00 41 b8 02 00 00 00 4c 89 fe 4c 89 e2 89 d9 e8 27 17 0c 00 44 89 f0 31 d2 <66> f7 f3 89 c3 ff c3 4d 8d b7 88 00 00 00 4c 89 f0 48 c1 e8 03 42 RSP: 0018:ffff88810bc0f858 EFLAGS: 00010246 RAX: 00000000000002a0 RBX: 0000000000000000 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: ffff88810bc0f7c0 RDI: ffffc90002dcb66f RBP: ffff88810bc0f880 R08: aa69db2dda70ff01 R09: 0000ffaaaaaaaaaa R10: 0084000000ffaaaa R11: 0000000000000000 R12: ffff88810d65a084 R13: dffffc0000000000 R14: 00000000000002a0 R15: ffff88810d65a000 FS: 0000000000000000(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000100 CR3: 0000000103268003 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <TASK> l2cap_le_connect_req net/bluetooth/l2cap_core.c:4902 [inline] l2cap_le_sig_cmd net/bluetooth/l2cap_core.c:5420 [inline] l2cap_le_sig_channel net/bluetooth/l2cap_core.c:5486 [inline] l2cap_recv_frame+0xe59d/0x11710 net/bluetooth/l2cap_core.c:6809 l2cap_recv_acldata+0x544/0x10a0 net/bluetooth/l2cap_core.c:7506 hci_acldata_packet net/bluetooth/hci_core.c:3939 [inline] hci_rx_work+0x5e5/0xb20 net/bluetooth/hci_core.c:4176 process_one_work kernel/workqueue.c:3254 [inline] process_scheduled_works+0x90f/0x1530 kernel/workqueue.c:3335 worker_thread+0x926/0xe70 kernel/workqueue.c:3416 kthread+0x2e3/0x380 kernel/kthread.c:388 ret_from_fork+0x5c/0x90 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- Fixes: 6ed58ec ("Bluetooth: Use LE buffers for LE traffic") Suggested-by: Luiz Augusto von Dentz <[email protected]> Signed-off-by: Sungwoo Kim <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
… non head_frag The crashed kernel version is 5.16.20, and I have not test this patch because I dont find a way to reproduce it, and the mailine may be has the same problem. When using bpf based NAT, hits a kernel BUG_ON at function skb_segment(), BUG_ON(skb_headlen(list_skb) > len). The bpf calls the bpf_skb_adjust_room to decrease the gso_size, and then call bpf_redirect send packet out. call stack: ... [exception RIP: skb_segment+3016] RIP: ffffffffb97df2a8 RSP: ffffa3f2cce08728 RFLAGS: 00010293 RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011 RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1 RBP: ffffa3f2cce087f8 R8: 000000000000088f R9: 0000000000000011 R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00 R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 kernel-patches#9 [ffffa3f2cce08720] skb_segment at ffffffffb97ded63 kernel-patches#10 [ffffa3f2cce08800] tcp_gso_segment at ffffffffb98d0320 kernel-patches#11 [ffffa3f2cce08860] tcp4_gso_segment at ffffffffb98d07a3 kernel-patches#12 [ffffa3f2cce08880] inet_gso_segment at ffffffffb98e6de0 kernel-patches#13 [ffffa3f2cce088e0] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#14 [ffffa3f2cce08918] skb_udp_tunnel_segment at ffffffffb98daa59 kernel-patches#15 [ffffa3f2cce08980] udp4_ufo_fragment at ffffffffb98db471 kernel-patches#16 [ffffa3f2cce089b0] inet_gso_segment at ffffffffb98e6de0 kernel-patches#17 [ffffa3f2cce08a10] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#18 [ffffa3f2cce08a48] __skb_gso_segment at ffffffffb97f388e kernel-patches#19 [ffffa3f2cce08a78] validate_xmit_skb at ffffffffb97f3d6e kernel-patches#20 [ffffa3f2cce08ab8] __dev_queue_xmit at ffffffffb97f4614 kernel-patches#21 [ffffa3f2cce08b50] dev_queue_xmit at ffffffffb97f5030 kernel-patches#22 [ffffa3f2cce08b60] __bpf_redirect at ffffffffb98199a8 kernel-patches#23 [ffffa3f2cce08b88] skb_do_redirect at ffffffffb98205cd ... The skb has the following properties: doffset = 66 list_skb = skb_shinfo(skb)->frag_list list_skb->head_frag = true skb->len = 2441 && skb->data_len = 2250 skb_shinfo(skb)->nr_frags = 17 skb_shinfo(skb)->gso_size = 75 skb_shinfo(skb)->frags[0...16].bv_len = 125 list_skb->len = 125 list_skb->data_len = 0 3962 struct sk_buff *skb_segment(struct sk_buff *head_skb, 3963 netdev_features_t features) 3964 { 3965 struct sk_buff *segs = NULL; 3966 struct sk_buff *tail = NULL; ... 4181 while (pos < offset + len) { 4182 if (i >= nfrags) { 4183 i = 0; 4184 nfrags = skb_shinfo(list_skb)->nr_frags; 4185 frag = skb_shinfo(list_skb)->frags; 4186 frag_skb = list_skb; After segment the head_skb's last frag, the (pos == offset+len), so break the while at line 4181, run into this BUG_ON(), not segment the head_frag frag_list skb. Since commit 13acc94(net: permit skb_segment on head_frag frag_list skb), it is allowed to segment the head_frag frag_list skb. In commit 3dcbdb1 (net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list), it is cleared the NETIF_F_SG if it has non head_frag skb. It is not cleared the NETIF_F_SG only with one head_frag frag_list skb. Signed-off-by: Fred Li <[email protected]> Signed-off-by: NipaLocal <nipa@local>
… non head_frag The crashed kernel version is 5.16.20, and I have not test this patch because I dont find a way to reproduce it, and the mailine may be has the same problem. When using bpf based NAT, hits a kernel BUG_ON at function skb_segment(), BUG_ON(skb_headlen(list_skb) > len). The bpf calls the bpf_skb_adjust_room to decrease the gso_size, and then call bpf_redirect send packet out. call stack: ... [exception RIP: skb_segment+3016] RIP: ffffffffb97df2a8 RSP: ffffa3f2cce08728 RFLAGS: 00010293 RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011 RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1 RBP: ffffa3f2cce087f8 R8: 000000000000088f R9: 0000000000000011 R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00 R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 kernel-patches#9 [ffffa3f2cce08720] skb_segment at ffffffffb97ded63 kernel-patches#10 [ffffa3f2cce08800] tcp_gso_segment at ffffffffb98d0320 kernel-patches#11 [ffffa3f2cce08860] tcp4_gso_segment at ffffffffb98d07a3 kernel-patches#12 [ffffa3f2cce08880] inet_gso_segment at ffffffffb98e6de0 kernel-patches#13 [ffffa3f2cce088e0] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#14 [ffffa3f2cce08918] skb_udp_tunnel_segment at ffffffffb98daa59 kernel-patches#15 [ffffa3f2cce08980] udp4_ufo_fragment at ffffffffb98db471 kernel-patches#16 [ffffa3f2cce089b0] inet_gso_segment at ffffffffb98e6de0 kernel-patches#17 [ffffa3f2cce08a10] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#18 [ffffa3f2cce08a48] __skb_gso_segment at ffffffffb97f388e kernel-patches#19 [ffffa3f2cce08a78] validate_xmit_skb at ffffffffb97f3d6e kernel-patches#20 [ffffa3f2cce08ab8] __dev_queue_xmit at ffffffffb97f4614 kernel-patches#21 [ffffa3f2cce08b50] dev_queue_xmit at ffffffffb97f5030 kernel-patches#22 [ffffa3f2cce08b60] __bpf_redirect at ffffffffb98199a8 kernel-patches#23 [ffffa3f2cce08b88] skb_do_redirect at ffffffffb98205cd ... The skb has the following properties: doffset = 66 list_skb = skb_shinfo(skb)->frag_list list_skb->head_frag = true skb->len = 2441 && skb->data_len = 2250 skb_shinfo(skb)->nr_frags = 17 skb_shinfo(skb)->gso_size = 75 skb_shinfo(skb)->frags[0...16].bv_len = 125 list_skb->len = 125 list_skb->data_len = 0 3962 struct sk_buff *skb_segment(struct sk_buff *head_skb, 3963 netdev_features_t features) 3964 { 3965 struct sk_buff *segs = NULL; 3966 struct sk_buff *tail = NULL; ... 4181 while (pos < offset + len) { 4182 if (i >= nfrags) { 4183 i = 0; 4184 nfrags = skb_shinfo(list_skb)->nr_frags; 4185 frag = skb_shinfo(list_skb)->frags; 4186 frag_skb = list_skb; After segment the head_skb's last frag, the (pos == offset+len), so break the while at line 4181, run into this BUG_ON(), not segment the head_frag frag_list skb. Since commit 13acc94(net: permit skb_segment on head_frag frag_list skb), it is allowed to segment the head_frag frag_list skb. In commit 3dcbdb1 (net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list), it is cleared the NETIF_F_SG if it has non head_frag skb. It is not cleared the NETIF_F_SG only with one head_frag frag_list skb. Signed-off-by: Fred Li <[email protected]> Signed-off-by: NipaLocal <nipa@local>
… non head_frag The crashed kernel version is 5.16.20, and I have not test this patch because I dont find a way to reproduce it, and the mailine may be has the same problem. When using bpf based NAT, hits a kernel BUG_ON at function skb_segment(), BUG_ON(skb_headlen(list_skb) > len). The bpf calls the bpf_skb_adjust_room to decrease the gso_size, and then call bpf_redirect send packet out. call stack: ... [exception RIP: skb_segment+3016] RIP: ffffffffb97df2a8 RSP: ffffa3f2cce08728 RFLAGS: 00010293 RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011 RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1 RBP: ffffa3f2cce087f8 R8: 000000000000088f R9: 0000000000000011 R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00 R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 kernel-patches#9 [ffffa3f2cce08720] skb_segment at ffffffffb97ded63 kernel-patches#10 [ffffa3f2cce08800] tcp_gso_segment at ffffffffb98d0320 kernel-patches#11 [ffffa3f2cce08860] tcp4_gso_segment at ffffffffb98d07a3 kernel-patches#12 [ffffa3f2cce08880] inet_gso_segment at ffffffffb98e6de0 kernel-patches#13 [ffffa3f2cce088e0] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#14 [ffffa3f2cce08918] skb_udp_tunnel_segment at ffffffffb98daa59 kernel-patches#15 [ffffa3f2cce08980] udp4_ufo_fragment at ffffffffb98db471 kernel-patches#16 [ffffa3f2cce089b0] inet_gso_segment at ffffffffb98e6de0 kernel-patches#17 [ffffa3f2cce08a10] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#18 [ffffa3f2cce08a48] __skb_gso_segment at ffffffffb97f388e kernel-patches#19 [ffffa3f2cce08a78] validate_xmit_skb at ffffffffb97f3d6e kernel-patches#20 [ffffa3f2cce08ab8] __dev_queue_xmit at ffffffffb97f4614 kernel-patches#21 [ffffa3f2cce08b50] dev_queue_xmit at ffffffffb97f5030 kernel-patches#22 [ffffa3f2cce08b60] __bpf_redirect at ffffffffb98199a8 kernel-patches#23 [ffffa3f2cce08b88] skb_do_redirect at ffffffffb98205cd ... The skb has the following properties: doffset = 66 list_skb = skb_shinfo(skb)->frag_list list_skb->head_frag = true skb->len = 2441 && skb->data_len = 2250 skb_shinfo(skb)->nr_frags = 17 skb_shinfo(skb)->gso_size = 75 skb_shinfo(skb)->frags[0...16].bv_len = 125 list_skb->len = 125 list_skb->data_len = 0 3962 struct sk_buff *skb_segment(struct sk_buff *head_skb, 3963 netdev_features_t features) 3964 { 3965 struct sk_buff *segs = NULL; 3966 struct sk_buff *tail = NULL; ... 4181 while (pos < offset + len) { 4182 if (i >= nfrags) { 4183 i = 0; 4184 nfrags = skb_shinfo(list_skb)->nr_frags; 4185 frag = skb_shinfo(list_skb)->frags; 4186 frag_skb = list_skb; After segment the head_skb's last frag, the (pos == offset+len), so break the while at line 4181, run into this BUG_ON(), not segment the head_frag frag_list skb. Since commit 13acc94(net: permit skb_segment on head_frag frag_list skb), it is allowed to segment the head_frag frag_list skb. In commit 3dcbdb1 (net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list), it is cleared the NETIF_F_SG if it has non head_frag skb. It is not cleared the NETIF_F_SG only with one head_frag frag_list skb. Signed-off-by: Fred Li <[email protected]> Signed-off-by: NipaLocal <nipa@local>
… non head_frag The crashed kernel version is 5.16.20, and I have not test this patch because I dont find a way to reproduce it, and the mailine may be has the same problem. When using bpf based NAT, hits a kernel BUG_ON at function skb_segment(), BUG_ON(skb_headlen(list_skb) > len). The bpf calls the bpf_skb_adjust_room to decrease the gso_size, and then call bpf_redirect send packet out. call stack: ... [exception RIP: skb_segment+3016] RIP: ffffffffb97df2a8 RSP: ffffa3f2cce08728 RFLAGS: 00010293 RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011 RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1 RBP: ffffa3f2cce087f8 R8: 000000000000088f R9: 0000000000000011 R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00 R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 kernel-patches#9 [ffffa3f2cce08720] skb_segment at ffffffffb97ded63 kernel-patches#10 [ffffa3f2cce08800] tcp_gso_segment at ffffffffb98d0320 kernel-patches#11 [ffffa3f2cce08860] tcp4_gso_segment at ffffffffb98d07a3 kernel-patches#12 [ffffa3f2cce08880] inet_gso_segment at ffffffffb98e6de0 kernel-patches#13 [ffffa3f2cce088e0] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#14 [ffffa3f2cce08918] skb_udp_tunnel_segment at ffffffffb98daa59 kernel-patches#15 [ffffa3f2cce08980] udp4_ufo_fragment at ffffffffb98db471 kernel-patches#16 [ffffa3f2cce089b0] inet_gso_segment at ffffffffb98e6de0 kernel-patches#17 [ffffa3f2cce08a10] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#18 [ffffa3f2cce08a48] __skb_gso_segment at ffffffffb97f388e kernel-patches#19 [ffffa3f2cce08a78] validate_xmit_skb at ffffffffb97f3d6e kernel-patches#20 [ffffa3f2cce08ab8] __dev_queue_xmit at ffffffffb97f4614 kernel-patches#21 [ffffa3f2cce08b50] dev_queue_xmit at ffffffffb97f5030 kernel-patches#22 [ffffa3f2cce08b60] __bpf_redirect at ffffffffb98199a8 kernel-patches#23 [ffffa3f2cce08b88] skb_do_redirect at ffffffffb98205cd ... The skb has the following properties: doffset = 66 list_skb = skb_shinfo(skb)->frag_list list_skb->head_frag = true skb->len = 2441 && skb->data_len = 2250 skb_shinfo(skb)->nr_frags = 17 skb_shinfo(skb)->gso_size = 75 skb_shinfo(skb)->frags[0...16].bv_len = 125 list_skb->len = 125 list_skb->data_len = 0 3962 struct sk_buff *skb_segment(struct sk_buff *head_skb, 3963 netdev_features_t features) 3964 { 3965 struct sk_buff *segs = NULL; 3966 struct sk_buff *tail = NULL; ... 4181 while (pos < offset + len) { 4182 if (i >= nfrags) { 4183 i = 0; 4184 nfrags = skb_shinfo(list_skb)->nr_frags; 4185 frag = skb_shinfo(list_skb)->frags; 4186 frag_skb = list_skb; After segment the head_skb's last frag, the (pos == offset+len), so break the while at line 4181, run into this BUG_ON(), not segment the head_frag frag_list skb. Since commit 13acc94(net: permit skb_segment on head_frag frag_list skb), it is allowed to segment the head_frag frag_list skb. In commit 3dcbdb1 (net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list), it is cleared the NETIF_F_SG if it has non head_frag skb. It is not cleared the NETIF_F_SG only with one head_frag frag_list skb. Signed-off-by: Fred Li <[email protected]> Signed-off-by: NipaLocal <nipa@local>
… non head_frag The crashed kernel version is 5.16.20, and I have not test this patch because I dont find a way to reproduce it, and the mailine may be has the same problem. When using bpf based NAT, hits a kernel BUG_ON at function skb_segment(), BUG_ON(skb_headlen(list_skb) > len). The bpf calls the bpf_skb_adjust_room to decrease the gso_size, and then call bpf_redirect send packet out. call stack: ... [exception RIP: skb_segment+3016] RIP: ffffffffb97df2a8 RSP: ffffa3f2cce08728 RFLAGS: 00010293 RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011 RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1 RBP: ffffa3f2cce087f8 R8: 000000000000088f R9: 0000000000000011 R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00 R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 kernel-patches#9 [ffffa3f2cce08720] skb_segment at ffffffffb97ded63 kernel-patches#10 [ffffa3f2cce08800] tcp_gso_segment at ffffffffb98d0320 kernel-patches#11 [ffffa3f2cce08860] tcp4_gso_segment at ffffffffb98d07a3 kernel-patches#12 [ffffa3f2cce08880] inet_gso_segment at ffffffffb98e6de0 kernel-patches#13 [ffffa3f2cce088e0] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#14 [ffffa3f2cce08918] skb_udp_tunnel_segment at ffffffffb98daa59 kernel-patches#15 [ffffa3f2cce08980] udp4_ufo_fragment at ffffffffb98db471 kernel-patches#16 [ffffa3f2cce089b0] inet_gso_segment at ffffffffb98e6de0 kernel-patches#17 [ffffa3f2cce08a10] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#18 [ffffa3f2cce08a48] __skb_gso_segment at ffffffffb97f388e kernel-patches#19 [ffffa3f2cce08a78] validate_xmit_skb at ffffffffb97f3d6e kernel-patches#20 [ffffa3f2cce08ab8] __dev_queue_xmit at ffffffffb97f4614 kernel-patches#21 [ffffa3f2cce08b50] dev_queue_xmit at ffffffffb97f5030 kernel-patches#22 [ffffa3f2cce08b60] __bpf_redirect at ffffffffb98199a8 kernel-patches#23 [ffffa3f2cce08b88] skb_do_redirect at ffffffffb98205cd ... The skb has the following properties: doffset = 66 list_skb = skb_shinfo(skb)->frag_list list_skb->head_frag = true skb->len = 2441 && skb->data_len = 2250 skb_shinfo(skb)->nr_frags = 17 skb_shinfo(skb)->gso_size = 75 skb_shinfo(skb)->frags[0...16].bv_len = 125 list_skb->len = 125 list_skb->data_len = 0 3962 struct sk_buff *skb_segment(struct sk_buff *head_skb, 3963 netdev_features_t features) 3964 { 3965 struct sk_buff *segs = NULL; 3966 struct sk_buff *tail = NULL; ... 4181 while (pos < offset + len) { 4182 if (i >= nfrags) { 4183 i = 0; 4184 nfrags = skb_shinfo(list_skb)->nr_frags; 4185 frag = skb_shinfo(list_skb)->frags; 4186 frag_skb = list_skb; After segment the head_skb's last frag, the (pos == offset+len), so break the while at line 4181, run into this BUG_ON(), not segment the head_frag frag_list skb. Since commit 13acc94(net: permit skb_segment on head_frag frag_list skb), it is allowed to segment the head_frag frag_list skb. In commit 3dcbdb1 (net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list), it is cleared the NETIF_F_SG if it has non head_frag skb. It is not cleared the NETIF_F_SG only with one head_frag frag_list skb. Signed-off-by: Fred Li <[email protected]> Signed-off-by: NipaLocal <nipa@local>
… non head_frag The crashed kernel version is 5.16.20, and I have not test this patch because I dont find a way to reproduce it, and the mailine may be has the same problem. When using bpf based NAT, hits a kernel BUG_ON at function skb_segment(), BUG_ON(skb_headlen(list_skb) > len). The bpf calls the bpf_skb_adjust_room to decrease the gso_size, and then call bpf_redirect send packet out. call stack: ... [exception RIP: skb_segment+3016] RIP: ffffffffb97df2a8 RSP: ffffa3f2cce08728 RFLAGS: 00010293 RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011 RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1 RBP: ffffa3f2cce087f8 R8: 000000000000088f R9: 0000000000000011 R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00 R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 kernel-patches#9 [ffffa3f2cce08720] skb_segment at ffffffffb97ded63 kernel-patches#10 [ffffa3f2cce08800] tcp_gso_segment at ffffffffb98d0320 kernel-patches#11 [ffffa3f2cce08860] tcp4_gso_segment at ffffffffb98d07a3 kernel-patches#12 [ffffa3f2cce08880] inet_gso_segment at ffffffffb98e6de0 kernel-patches#13 [ffffa3f2cce088e0] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#14 [ffffa3f2cce08918] skb_udp_tunnel_segment at ffffffffb98daa59 kernel-patches#15 [ffffa3f2cce08980] udp4_ufo_fragment at ffffffffb98db471 kernel-patches#16 [ffffa3f2cce089b0] inet_gso_segment at ffffffffb98e6de0 kernel-patches#17 [ffffa3f2cce08a10] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#18 [ffffa3f2cce08a48] __skb_gso_segment at ffffffffb97f388e kernel-patches#19 [ffffa3f2cce08a78] validate_xmit_skb at ffffffffb97f3d6e kernel-patches#20 [ffffa3f2cce08ab8] __dev_queue_xmit at ffffffffb97f4614 kernel-patches#21 [ffffa3f2cce08b50] dev_queue_xmit at ffffffffb97f5030 kernel-patches#22 [ffffa3f2cce08b60] __bpf_redirect at ffffffffb98199a8 kernel-patches#23 [ffffa3f2cce08b88] skb_do_redirect at ffffffffb98205cd ... The skb has the following properties: doffset = 66 list_skb = skb_shinfo(skb)->frag_list list_skb->head_frag = true skb->len = 2441 && skb->data_len = 2250 skb_shinfo(skb)->nr_frags = 17 skb_shinfo(skb)->gso_size = 75 skb_shinfo(skb)->frags[0...16].bv_len = 125 list_skb->len = 125 list_skb->data_len = 0 3962 struct sk_buff *skb_segment(struct sk_buff *head_skb, 3963 netdev_features_t features) 3964 { 3965 struct sk_buff *segs = NULL; 3966 struct sk_buff *tail = NULL; ... 4181 while (pos < offset + len) { 4182 if (i >= nfrags) { 4183 i = 0; 4184 nfrags = skb_shinfo(list_skb)->nr_frags; 4185 frag = skb_shinfo(list_skb)->frags; 4186 frag_skb = list_skb; After segment the head_skb's last frag, the (pos == offset+len), so break the while at line 4181, run into this BUG_ON(), not segment the head_frag frag_list skb. Since commit 13acc94(net: permit skb_segment on head_frag frag_list skb), it is allowed to segment the head_frag frag_list skb. In commit 3dcbdb1 (net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list), it is cleared the NETIF_F_SG if it has non head_frag skb. It is not cleared the NETIF_F_SG only with one head_frag frag_list skb. Signed-off-by: Fred Li <[email protected]> Signed-off-by: NipaLocal <nipa@local>
… non head_frag The crashed kernel version is 5.16.20, and I have not test this patch because I dont find a way to reproduce it, and the mailine may be has the same problem. When using bpf based NAT, hits a kernel BUG_ON at function skb_segment(), BUG_ON(skb_headlen(list_skb) > len). The bpf calls the bpf_skb_adjust_room to decrease the gso_size, and then call bpf_redirect send packet out. call stack: ... [exception RIP: skb_segment+3016] RIP: ffffffffb97df2a8 RSP: ffffa3f2cce08728 RFLAGS: 00010293 RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011 RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1 RBP: ffffa3f2cce087f8 R8: 000000000000088f R9: 0000000000000011 R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00 R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 kernel-patches#9 [ffffa3f2cce08720] skb_segment at ffffffffb97ded63 kernel-patches#10 [ffffa3f2cce08800] tcp_gso_segment at ffffffffb98d0320 kernel-patches#11 [ffffa3f2cce08860] tcp4_gso_segment at ffffffffb98d07a3 kernel-patches#12 [ffffa3f2cce08880] inet_gso_segment at ffffffffb98e6de0 kernel-patches#13 [ffffa3f2cce088e0] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#14 [ffffa3f2cce08918] skb_udp_tunnel_segment at ffffffffb98daa59 kernel-patches#15 [ffffa3f2cce08980] udp4_ufo_fragment at ffffffffb98db471 kernel-patches#16 [ffffa3f2cce089b0] inet_gso_segment at ffffffffb98e6de0 kernel-patches#17 [ffffa3f2cce08a10] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#18 [ffffa3f2cce08a48] __skb_gso_segment at ffffffffb97f388e kernel-patches#19 [ffffa3f2cce08a78] validate_xmit_skb at ffffffffb97f3d6e kernel-patches#20 [ffffa3f2cce08ab8] __dev_queue_xmit at ffffffffb97f4614 kernel-patches#21 [ffffa3f2cce08b50] dev_queue_xmit at ffffffffb97f5030 kernel-patches#22 [ffffa3f2cce08b60] __bpf_redirect at ffffffffb98199a8 kernel-patches#23 [ffffa3f2cce08b88] skb_do_redirect at ffffffffb98205cd ... The skb has the following properties: doffset = 66 list_skb = skb_shinfo(skb)->frag_list list_skb->head_frag = true skb->len = 2441 && skb->data_len = 2250 skb_shinfo(skb)->nr_frags = 17 skb_shinfo(skb)->gso_size = 75 skb_shinfo(skb)->frags[0...16].bv_len = 125 list_skb->len = 125 list_skb->data_len = 0 3962 struct sk_buff *skb_segment(struct sk_buff *head_skb, 3963 netdev_features_t features) 3964 { 3965 struct sk_buff *segs = NULL; 3966 struct sk_buff *tail = NULL; ... 4181 while (pos < offset + len) { 4182 if (i >= nfrags) { 4183 i = 0; 4184 nfrags = skb_shinfo(list_skb)->nr_frags; 4185 frag = skb_shinfo(list_skb)->frags; 4186 frag_skb = list_skb; After segment the head_skb's last frag, the (pos == offset+len), so break the while at line 4181, run into this BUG_ON(), not segment the head_frag frag_list skb. Since commit 13acc94(net: permit skb_segment on head_frag frag_list skb), it is allowed to segment the head_frag frag_list skb. In commit 3dcbdb1 (net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list), it is cleared the NETIF_F_SG if it has non head_frag skb. It is not cleared the NETIF_F_SG only with one head_frag frag_list skb. Signed-off-by: Fred Li <[email protected]> Signed-off-by: NipaLocal <nipa@local>
… non head_frag The crashed kernel version is 5.16.20, and I have not test this patch because I dont find a way to reproduce it, and the mailine may be has the same problem. When using bpf based NAT, hits a kernel BUG_ON at function skb_segment(), BUG_ON(skb_headlen(list_skb) > len). The bpf calls the bpf_skb_adjust_room to decrease the gso_size, and then call bpf_redirect send packet out. call stack: ... [exception RIP: skb_segment+3016] RIP: ffffffffb97df2a8 RSP: ffffa3f2cce08728 RFLAGS: 00010293 RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011 RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1 RBP: ffffa3f2cce087f8 R8: 000000000000088f R9: 0000000000000011 R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00 R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 kernel-patches#9 [ffffa3f2cce08720] skb_segment at ffffffffb97ded63 kernel-patches#10 [ffffa3f2cce08800] tcp_gso_segment at ffffffffb98d0320 kernel-patches#11 [ffffa3f2cce08860] tcp4_gso_segment at ffffffffb98d07a3 kernel-patches#12 [ffffa3f2cce08880] inet_gso_segment at ffffffffb98e6de0 kernel-patches#13 [ffffa3f2cce088e0] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#14 [ffffa3f2cce08918] skb_udp_tunnel_segment at ffffffffb98daa59 kernel-patches#15 [ffffa3f2cce08980] udp4_ufo_fragment at ffffffffb98db471 kernel-patches#16 [ffffa3f2cce089b0] inet_gso_segment at ffffffffb98e6de0 kernel-patches#17 [ffffa3f2cce08a10] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#18 [ffffa3f2cce08a48] __skb_gso_segment at ffffffffb97f388e kernel-patches#19 [ffffa3f2cce08a78] validate_xmit_skb at ffffffffb97f3d6e kernel-patches#20 [ffffa3f2cce08ab8] __dev_queue_xmit at ffffffffb97f4614 kernel-patches#21 [ffffa3f2cce08b50] dev_queue_xmit at ffffffffb97f5030 kernel-patches#22 [ffffa3f2cce08b60] __bpf_redirect at ffffffffb98199a8 kernel-patches#23 [ffffa3f2cce08b88] skb_do_redirect at ffffffffb98205cd ... The skb has the following properties: doffset = 66 list_skb = skb_shinfo(skb)->frag_list list_skb->head_frag = true skb->len = 2441 && skb->data_len = 2250 skb_shinfo(skb)->nr_frags = 17 skb_shinfo(skb)->gso_size = 75 skb_shinfo(skb)->frags[0...16].bv_len = 125 list_skb->len = 125 list_skb->data_len = 0 3962 struct sk_buff *skb_segment(struct sk_buff *head_skb, 3963 netdev_features_t features) 3964 { 3965 struct sk_buff *segs = NULL; 3966 struct sk_buff *tail = NULL; ... 4181 while (pos < offset + len) { 4182 if (i >= nfrags) { 4183 i = 0; 4184 nfrags = skb_shinfo(list_skb)->nr_frags; 4185 frag = skb_shinfo(list_skb)->frags; 4186 frag_skb = list_skb; After segment the head_skb's last frag, the (pos == offset+len), so break the while at line 4181, run into this BUG_ON(), not segment the head_frag frag_list skb. Since commit 13acc94(net: permit skb_segment on head_frag frag_list skb), it is allowed to segment the head_frag frag_list skb. In commit 3dcbdb1 (net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list), it is cleared the NETIF_F_SG if it has non head_frag skb. It is not cleared the NETIF_F_SG only with one head_frag frag_list skb. Signed-off-by: Fred Li <[email protected]> Signed-off-by: NipaLocal <nipa@local>
… non head_frag The crashed kernel version is 5.16.20, and I have not test this patch because I dont find a way to reproduce it, and the mailine may be has the same problem. When using bpf based NAT, hits a kernel BUG_ON at function skb_segment(), BUG_ON(skb_headlen(list_skb) > len). The bpf calls the bpf_skb_adjust_room to decrease the gso_size, and then call bpf_redirect send packet out. call stack: ... [exception RIP: skb_segment+3016] RIP: ffffffffb97df2a8 RSP: ffffa3f2cce08728 RFLAGS: 00010293 RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011 RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1 RBP: ffffa3f2cce087f8 R8: 000000000000088f R9: 0000000000000011 R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00 R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 kernel-patches#9 [ffffa3f2cce08720] skb_segment at ffffffffb97ded63 kernel-patches#10 [ffffa3f2cce08800] tcp_gso_segment at ffffffffb98d0320 kernel-patches#11 [ffffa3f2cce08860] tcp4_gso_segment at ffffffffb98d07a3 kernel-patches#12 [ffffa3f2cce08880] inet_gso_segment at ffffffffb98e6de0 kernel-patches#13 [ffffa3f2cce088e0] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#14 [ffffa3f2cce08918] skb_udp_tunnel_segment at ffffffffb98daa59 kernel-patches#15 [ffffa3f2cce08980] udp4_ufo_fragment at ffffffffb98db471 kernel-patches#16 [ffffa3f2cce089b0] inet_gso_segment at ffffffffb98e6de0 kernel-patches#17 [ffffa3f2cce08a10] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#18 [ffffa3f2cce08a48] __skb_gso_segment at ffffffffb97f388e kernel-patches#19 [ffffa3f2cce08a78] validate_xmit_skb at ffffffffb97f3d6e kernel-patches#20 [ffffa3f2cce08ab8] __dev_queue_xmit at ffffffffb97f4614 kernel-patches#21 [ffffa3f2cce08b50] dev_queue_xmit at ffffffffb97f5030 kernel-patches#22 [ffffa3f2cce08b60] __bpf_redirect at ffffffffb98199a8 kernel-patches#23 [ffffa3f2cce08b88] skb_do_redirect at ffffffffb98205cd ... The skb has the following properties: doffset = 66 list_skb = skb_shinfo(skb)->frag_list list_skb->head_frag = true skb->len = 2441 && skb->data_len = 2250 skb_shinfo(skb)->nr_frags = 17 skb_shinfo(skb)->gso_size = 75 skb_shinfo(skb)->frags[0...16].bv_len = 125 list_skb->len = 125 list_skb->data_len = 0 3962 struct sk_buff *skb_segment(struct sk_buff *head_skb, 3963 netdev_features_t features) 3964 { 3965 struct sk_buff *segs = NULL; 3966 struct sk_buff *tail = NULL; ... 4181 while (pos < offset + len) { 4182 if (i >= nfrags) { 4183 i = 0; 4184 nfrags = skb_shinfo(list_skb)->nr_frags; 4185 frag = skb_shinfo(list_skb)->frags; 4186 frag_skb = list_skb; After segment the head_skb's last frag, the (pos == offset+len), so break the while at line 4181, run into this BUG_ON(), not segment the head_frag frag_list skb. Since commit 13acc94(net: permit skb_segment on head_frag frag_list skb), it is allowed to segment the head_frag frag_list skb. In commit 3dcbdb1 (net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list), it is cleared the NETIF_F_SG if it has non head_frag skb. It is not cleared the NETIF_F_SG only with one head_frag frag_list skb. Signed-off-by: Fred Li <[email protected]> Signed-off-by: NipaLocal <nipa@local>
… non head_frag The crashed kernel version is 5.16.20, and I have not test this patch because I dont find a way to reproduce it, and the mailine may be has the same problem. When using bpf based NAT, hits a kernel BUG_ON at function skb_segment(), BUG_ON(skb_headlen(list_skb) > len). The bpf calls the bpf_skb_adjust_room to decrease the gso_size, and then call bpf_redirect send packet out. call stack: ... [exception RIP: skb_segment+3016] RIP: ffffffffb97df2a8 RSP: ffffa3f2cce08728 RFLAGS: 00010293 RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011 RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1 RBP: ffffa3f2cce087f8 R8: 000000000000088f R9: 0000000000000011 R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00 R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 kernel-patches#9 [ffffa3f2cce08720] skb_segment at ffffffffb97ded63 kernel-patches#10 [ffffa3f2cce08800] tcp_gso_segment at ffffffffb98d0320 kernel-patches#11 [ffffa3f2cce08860] tcp4_gso_segment at ffffffffb98d07a3 kernel-patches#12 [ffffa3f2cce08880] inet_gso_segment at ffffffffb98e6de0 kernel-patches#13 [ffffa3f2cce088e0] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#14 [ffffa3f2cce08918] skb_udp_tunnel_segment at ffffffffb98daa59 kernel-patches#15 [ffffa3f2cce08980] udp4_ufo_fragment at ffffffffb98db471 kernel-patches#16 [ffffa3f2cce089b0] inet_gso_segment at ffffffffb98e6de0 kernel-patches#17 [ffffa3f2cce08a10] skb_mac_gso_segment at ffffffffb97f3741 kernel-patches#18 [ffffa3f2cce08a48] __skb_gso_segment at ffffffffb97f388e kernel-patches#19 [ffffa3f2cce08a78] validate_xmit_skb at ffffffffb97f3d6e kernel-patches#20 [ffffa3f2cce08ab8] __dev_queue_xmit at ffffffffb97f4614 kernel-patches#21 [ffffa3f2cce08b50] dev_queue_xmit at ffffffffb97f5030 kernel-patches#22 [ffffa3f2cce08b60] __bpf_redirect at ffffffffb98199a8 kernel-patches#23 [ffffa3f2cce08b88] skb_do_redirect at ffffffffb98205cd ... The skb has the following properties: doffset = 66 list_skb = skb_shinfo(skb)->frag_list list_skb->head_frag = true skb->len = 2441 && skb->data_len = 2250 skb_shinfo(skb)->nr_frags = 17 skb_shinfo(skb)->gso_size = 75 skb_shinfo(skb)->frags[0...16].bv_len = 125 list_skb->len = 125 list_skb->data_len = 0 3962 struct sk_buff *skb_segment(struct sk_buff *head_skb, 3963 netdev_features_t features) 3964 { 3965 struct sk_buff *segs = NULL; 3966 struct sk_buff *tail = NULL; ... 4181 while (pos < offset + len) { 4182 if (i >= nfrags) { 4183 i = 0; 4184 nfrags = skb_shinfo(list_skb)->nr_frags; 4185 frag = skb_shinfo(list_skb)->frags; 4186 frag_skb = list_skb; After segment the head_skb's last frag, the (pos == offset+len), so break the while at line 4181, run into this BUG_ON(), not segment the head_frag frag_list skb. Since commit 13acc94(net: permit skb_segment on head_frag frag_list skb), it is allowed to segment the head_frag frag_list skb. In commit 3dcbdb1 (net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list), it is cleared the NETIF_F_SG if it has non head_frag skb. It is not cleared the NETIF_F_SG only with one head_frag frag_list skb. Signed-off-by: Fred Li <[email protected]> Signed-off-by: NipaLocal <nipa@local>
ui_browser__show() is capturing the input title that is stack allocated memory in hist_browser__run(). Avoid a use after return by strdup-ing the string. Committer notes: Further explanation from Ian Rogers: My command line using tui is: $ sudo bash -c 'rm /tmp/asan.log*; export ASAN_OPTIONS="log_path=/tmp/asan.log"; /tmp/perf/perf mem record -a sleep 1; /tmp/perf/perf mem report' I then go to the perf annotate view and quit. This triggers the asan error (from the log file): ``` ==1254591==ERROR: AddressSanitizer: stack-use-after-return on address 0x7f2813331920 at pc 0x7f28180 65991 bp 0x7fff0a21c750 sp 0x7fff0a21bf10 READ of size 80 at 0x7f2813331920 thread T0 #0 0x7f2818065990 in __interceptor_strlen ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:461 #1 0x7f2817698251 in SLsmg_write_wrapped_string (/lib/x86_64-linux-gnu/libslang.so.2+0x98251) #2 0x7f28176984b9 in SLsmg_write_nstring (/lib/x86_64-linux-gnu/libslang.so.2+0x984b9) #3 0x55c94045b365 in ui_browser__write_nstring ui/browser.c:60 #4 0x55c94045c558 in __ui_browser__show_title ui/browser.c:266 #5 0x55c94045c776 in ui_browser__show ui/browser.c:288 #6 0x55c94045c06d in ui_browser__handle_resize ui/browser.c:206 #7 0x55c94047979b in do_annotate ui/browsers/hists.c:2458 #8 0x55c94047fb17 in evsel__hists_browse ui/browsers/hists.c:3412 #9 0x55c940480a0c in perf_evsel_menu__run ui/browsers/hists.c:3527 #10 0x55c940481108 in __evlist__tui_browse_hists ui/browsers/hists.c:3613 #11 0x55c9404813f7 in evlist__tui_browse_hists ui/browsers/hists.c:3661 #12 0x55c93ffa253f in report__browse_hists tools/perf/builtin-report.c:671 #13 0x55c93ffa58ca in __cmd_report tools/perf/builtin-report.c:1141 #14 0x55c93ffaf159 in cmd_report tools/perf/builtin-report.c:1805 #15 0x55c94000c05c in report_events tools/perf/builtin-mem.c:374 #16 0x55c94000d96d in cmd_mem tools/perf/builtin-mem.c:516 #17 0x55c9400e44ee in run_builtin tools/perf/perf.c:350 #18 0x55c9400e4a5a in handle_internal_command tools/perf/perf.c:403 #19 0x55c9400e4e22 in run_argv tools/perf/perf.c:447 #20 0x55c9400e53ad in main tools/perf/perf.c:561 #21 0x7f28170456c9 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 #22 0x7f2817045784 in __libc_start_main_impl ../csu/libc-start.c:360 #23 0x55c93ff544c0 in _start (/tmp/perf/perf+0x19a4c0) (BuildId: 84899b0e8c7d3a3eaa67b2eb35e3d8b2f8cd4c93) Address 0x7f2813331920 is located in stack of thread T0 at offset 32 in frame #0 0x55c94046e85e in hist_browser__run ui/browsers/hists.c:746 This frame has 1 object(s): [32, 192) 'title' (line 747) <== Memory access at offset 32 is inside this variable HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork ``` hist_browser__run isn't on the stack so the asan error looks legit. There's no clean init/exit on struct ui_browser so I may be trading a use-after-return for a memory leak, but that seems look a good trade anyway. Fixes: 05e8b08 ("perf ui browser: Stop using 'self'") Signed-off-by: Ian Rogers <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Athira Rajeev <[email protected]> Cc: Ben Gainey <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James Clark <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kajol Jain <[email protected]> Cc: Kan Liang <[email protected]> Cc: K Prateek Nayak <[email protected]> Cc: Li Dong <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Paran Lee <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ravi Bangoria <[email protected]> Cc: Sun Haiyong <[email protected]> Cc: Tim Chen <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Yicong Yang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Run the following BPF selftests on Loongarch: ./test_progs -t sockmap_listen A Kernel panic occurs: ''' Oops[#1]: CPU: 49 PID: 233429 Comm: new_name Tainted: G OE 6.10.0-rc2+ #20 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018-V4.0.11 pc 0000000000000000 ra 90000000051ea4a0 tp 900030008549c000 sp 900030008549fe00 a0 9000300152524a00 a1 0000000000000000 a2 900030008549fe38 a3 900030008549fe30 a4 900030008549fe30 a5 90003000c58c8d80 a6 0000000000000000 a7 0000000000000039 t0 0000000000000000 t1 90003000c58c8d80 t2 0000000000000001 t3 0000000000000000 t4 0000000000000001 t5 900000011a1bf580 t6 900000011a3aff60 t7 000000000000006b t8 00000fffffffffff u0 0000000000000000 s9 00007fffbbe9e930 s0 9000300152524a00 s1 90003000c58c8d00 s2 9000000006c81568 s3 0000000000000000 s4 90003000c58c8d80 s5 00007ffff236a000 s6 00007ffffbc292b0 s7 00007ffffbc29998 s8 00007fffbbe9f180 ra: 90000000051ea4a0 inet_release+0x60/0xc0 ERA: 0000000000000000 0x0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000000 (-FPE -SXE -ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00030000 [PIF] (IS= ECode=3 EsubCode=0) BADV: 0000000000000000 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp Process new_name (pid: 233429, threadinfo=00000000b9196405, task=00000000c01df45b) Stack : 0000000000000000 90003000c58c8e20 90003000c58c8d00 900000000505960c 0000000000000000 9000000101c6ad20 9000300086524540 00000000082e0003 900030008bf57400 90000000050596bc 900030008bf57400 900000000434acac 0000000000000016 00007ffff224e060 00007fffbbe9f180 900030008bf57400 0000000000000000 9000000004341ce0 00007fffbbe9f180 00007ffff2369000 900030008549fec0 90000000054476ec 000000000000006b 9000000003f71da4 000000000000003a 00007ffff22b8a44 00007fffbbe9f8e0 00007fffbbe9e680 ffffffffffffffda 0000000000000000 0000000000000000 0000000000000000 00007fffbbe9f288 0000000000000000 0000000000000000 0000000000000039 84c2431493ceab6e 84c23ceb2827425e 0000000000000007 00007ffff2271600 ... Call Trace: [<900000000505960c>] __sock_release+0x4c/0xe0 [<90000000050596bc>] sock_close+0x1c/0x40 [<900000000434acac>] __fput+0xec/0x2e0 [<9000000004341ce0>] sys_close+0x40/0xa0 [<90000000054476ec>] do_syscall+0x8c/0xc0 [<9000000003f71da4>] handle_syscall+0xc4/0x160 Code: (Bad address in era) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3d50000 .text @ 0x9000000003f50000 .data @ 0x90000000055b0000 .bss @ 0x9000000006ca9400 ---[ end Kernel panic - not syncing: Fatal exception ]--- ''' This is because "sk->sk_prot->close" pointer is NULL in that case. This patch adds null check for it in inet_release() to fix this error. Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: Geliang Tang <[email protected]>
Run the following BPF selftests on Loongarch: ./test_progs -t sockmap_listen A Kernel panic occurs: ''' Oops[#1]: CPU: 49 PID: 233429 Comm: new_name Tainted: G OE 6.10.0-rc2+ #20 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018-V4.0.11 pc 0000000000000000 ra 90000000051ea4a0 tp 900030008549c000 sp 900030008549fe00 a0 9000300152524a00 a1 0000000000000000 a2 900030008549fe38 a3 900030008549fe30 a4 900030008549fe30 a5 90003000c58c8d80 a6 0000000000000000 a7 0000000000000039 t0 0000000000000000 t1 90003000c58c8d80 t2 0000000000000001 t3 0000000000000000 t4 0000000000000001 t5 900000011a1bf580 t6 900000011a3aff60 t7 000000000000006b t8 00000fffffffffff u0 0000000000000000 s9 00007fffbbe9e930 s0 9000300152524a00 s1 90003000c58c8d00 s2 9000000006c81568 s3 0000000000000000 s4 90003000c58c8d80 s5 00007ffff236a000 s6 00007ffffbc292b0 s7 00007ffffbc29998 s8 00007fffbbe9f180 ra: 90000000051ea4a0 inet_release+0x60/0xc0 ERA: 0000000000000000 0x0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000000 (-FPE -SXE -ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00030000 [PIF] (IS= ECode=3 EsubCode=0) BADV: 0000000000000000 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp Process new_name (pid: 233429, threadinfo=00000000b9196405, task=00000000c01df45b) Stack : 0000000000000000 90003000c58c8e20 90003000c58c8d00 900000000505960c 0000000000000000 9000000101c6ad20 9000300086524540 00000000082e0003 900030008bf57400 90000000050596bc 900030008bf57400 900000000434acac 0000000000000016 00007ffff224e060 00007fffbbe9f180 900030008bf57400 0000000000000000 9000000004341ce0 00007fffbbe9f180 00007ffff2369000 900030008549fec0 90000000054476ec 000000000000006b 9000000003f71da4 000000000000003a 00007ffff22b8a44 00007fffbbe9f8e0 00007fffbbe9e680 ffffffffffffffda 0000000000000000 0000000000000000 0000000000000000 00007fffbbe9f288 0000000000000000 0000000000000000 0000000000000039 84c2431493ceab6e 84c23ceb2827425e 0000000000000007 00007ffff2271600 ... Call Trace: [<900000000505960c>] __sock_release+0x4c/0xe0 [<90000000050596bc>] sock_close+0x1c/0x40 [<900000000434acac>] __fput+0xec/0x2e0 [<9000000004341ce0>] sys_close+0x40/0xa0 [<90000000054476ec>] do_syscall+0x8c/0xc0 [<9000000003f71da4>] handle_syscall+0xc4/0x160 Code: (Bad address in era) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3d50000 .text @ 0x9000000003f50000 .data @ 0x90000000055b0000 .bss @ 0x9000000006ca9400 ---[ end Kernel panic - not syncing: Fatal exception ]--- ''' This is because "sk->sk_prot->close" pointer is NULL in that case. This patch adds null check for it in inet_release() to fix this error. Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: Geliang Tang <[email protected]>
Run the following BPF selftests on Loongarch: ./test_progs -t sockmap_listen A Kernel panic occurs: ''' Oops[#1]: CPU: 49 PID: 233429 Comm: new_name Tainted: G OE 6.10.0-rc2+ #20 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018-V4.0.11 pc 0000000000000000 ra 90000000051ea4a0 tp 900030008549c000 sp 900030008549fe00 a0 9000300152524a00 a1 0000000000000000 a2 900030008549fe38 a3 900030008549fe30 a4 900030008549fe30 a5 90003000c58c8d80 a6 0000000000000000 a7 0000000000000039 t0 0000000000000000 t1 90003000c58c8d80 t2 0000000000000001 t3 0000000000000000 t4 0000000000000001 t5 900000011a1bf580 t6 900000011a3aff60 t7 000000000000006b t8 00000fffffffffff u0 0000000000000000 s9 00007fffbbe9e930 s0 9000300152524a00 s1 90003000c58c8d00 s2 9000000006c81568 s3 0000000000000000 s4 90003000c58c8d80 s5 00007ffff236a000 s6 00007ffffbc292b0 s7 00007ffffbc29998 s8 00007fffbbe9f180 ra: 90000000051ea4a0 inet_release+0x60/0xc0 ERA: 0000000000000000 0x0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000000 (-FPE -SXE -ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00030000 [PIF] (IS= ECode=3 EsubCode=0) BADV: 0000000000000000 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp Process new_name (pid: 233429, threadinfo=00000000b9196405, task=00000000c01df45b) Stack : 0000000000000000 90003000c58c8e20 90003000c58c8d00 900000000505960c 0000000000000000 9000000101c6ad20 9000300086524540 00000000082e0003 900030008bf57400 90000000050596bc 900030008bf57400 900000000434acac 0000000000000016 00007ffff224e060 00007fffbbe9f180 900030008bf57400 0000000000000000 9000000004341ce0 00007fffbbe9f180 00007ffff2369000 900030008549fec0 90000000054476ec 000000000000006b 9000000003f71da4 000000000000003a 00007ffff22b8a44 00007fffbbe9f8e0 00007fffbbe9e680 ffffffffffffffda 0000000000000000 0000000000000000 0000000000000000 00007fffbbe9f288 0000000000000000 0000000000000000 0000000000000039 84c2431493ceab6e 84c23ceb2827425e 0000000000000007 00007ffff2271600 ... Call Trace: [<900000000505960c>] __sock_release+0x4c/0xe0 [<90000000050596bc>] sock_close+0x1c/0x40 [<900000000434acac>] __fput+0xec/0x2e0 [<9000000004341ce0>] sys_close+0x40/0xa0 [<90000000054476ec>] do_syscall+0x8c/0xc0 [<9000000003f71da4>] handle_syscall+0xc4/0x160 Code: (Bad address in era) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3d50000 .text @ 0x9000000003f50000 .data @ 0x90000000055b0000 .bss @ 0x9000000006ca9400 ---[ end Kernel panic - not syncing: Fatal exception ]--- ''' This is because "sk->sk_prot->close" pointer is NULL in that case. This patch adds null check for it in inet_release() to fix this error. Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: Geliang Tang <[email protected]>
Run the following BPF selftests on Loongarch: ./test_progs -t sockmap_listen A Kernel panic occurs: ''' Oops[#1]: CPU: 49 PID: 233429 Comm: new_name Tainted: G OE 6.10.0-rc2+ #20 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018-V4.0.11 pc 0000000000000000 ra 90000000051ea4a0 tp 900030008549c000 sp 900030008549fe00 a0 9000300152524a00 a1 0000000000000000 a2 900030008549fe38 a3 900030008549fe30 a4 900030008549fe30 a5 90003000c58c8d80 a6 0000000000000000 a7 0000000000000039 t0 0000000000000000 t1 90003000c58c8d80 t2 0000000000000001 t3 0000000000000000 t4 0000000000000001 t5 900000011a1bf580 t6 900000011a3aff60 t7 000000000000006b t8 00000fffffffffff u0 0000000000000000 s9 00007fffbbe9e930 s0 9000300152524a00 s1 90003000c58c8d00 s2 9000000006c81568 s3 0000000000000000 s4 90003000c58c8d80 s5 00007ffff236a000 s6 00007ffffbc292b0 s7 00007ffffbc29998 s8 00007fffbbe9f180 ra: 90000000051ea4a0 inet_release+0x60/0xc0 ERA: 0000000000000000 0x0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000000 (-FPE -SXE -ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00030000 [PIF] (IS= ECode=3 EsubCode=0) BADV: 0000000000000000 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp Process new_name (pid: 233429, threadinfo=00000000b9196405, task=00000000c01df45b) Stack : 0000000000000000 90003000c58c8e20 90003000c58c8d00 900000000505960c 0000000000000000 9000000101c6ad20 9000300086524540 00000000082e0003 900030008bf57400 90000000050596bc 900030008bf57400 900000000434acac 0000000000000016 00007ffff224e060 00007fffbbe9f180 900030008bf57400 0000000000000000 9000000004341ce0 00007fffbbe9f180 00007ffff2369000 900030008549fec0 90000000054476ec 000000000000006b 9000000003f71da4 000000000000003a 00007ffff22b8a44 00007fffbbe9f8e0 00007fffbbe9e680 ffffffffffffffda 0000000000000000 0000000000000000 0000000000000000 00007fffbbe9f288 0000000000000000 0000000000000000 0000000000000039 84c2431493ceab6e 84c23ceb2827425e 0000000000000007 00007ffff2271600 ... Call Trace: [<900000000505960c>] __sock_release+0x4c/0xe0 [<90000000050596bc>] sock_close+0x1c/0x40 [<900000000434acac>] __fput+0xec/0x2e0 [<9000000004341ce0>] sys_close+0x40/0xa0 [<90000000054476ec>] do_syscall+0x8c/0xc0 [<9000000003f71da4>] handle_syscall+0xc4/0x160 Code: (Bad address in era) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3d50000 .text @ 0x9000000003f50000 .data @ 0x90000000055b0000 .bss @ 0x9000000006ca9400 ---[ end Kernel panic - not syncing: Fatal exception ]--- ''' This is because "sk->sk_prot->close" pointer is NULL in that case. This patch adds null check for it in inet_release() to fix this error. Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: Geliang Tang <[email protected]>
Run the following BPF selftests on Loongarch: ./test_progs -t sockmap_listen A Kernel panic occurs: ''' Oops[#1]: CPU: 49 PID: 233429 Comm: new_name Tainted: G OE 6.10.0-rc2+ #20 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018-V4.0.11 pc 0000000000000000 ra 90000000051ea4a0 tp 900030008549c000 sp 900030008549fe00 a0 9000300152524a00 a1 0000000000000000 a2 900030008549fe38 a3 900030008549fe30 a4 900030008549fe30 a5 90003000c58c8d80 a6 0000000000000000 a7 0000000000000039 t0 0000000000000000 t1 90003000c58c8d80 t2 0000000000000001 t3 0000000000000000 t4 0000000000000001 t5 900000011a1bf580 t6 900000011a3aff60 t7 000000000000006b t8 00000fffffffffff u0 0000000000000000 s9 00007fffbbe9e930 s0 9000300152524a00 s1 90003000c58c8d00 s2 9000000006c81568 s3 0000000000000000 s4 90003000c58c8d80 s5 00007ffff236a000 s6 00007ffffbc292b0 s7 00007ffffbc29998 s8 00007fffbbe9f180 ra: 90000000051ea4a0 inet_release+0x60/0xc0 ERA: 0000000000000000 0x0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000000 (-FPE -SXE -ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00030000 [PIF] (IS= ECode=3 EsubCode=0) BADV: 0000000000000000 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp Process new_name (pid: 233429, threadinfo=00000000b9196405, task=00000000c01df45b) Stack : 0000000000000000 90003000c58c8e20 90003000c58c8d00 900000000505960c 0000000000000000 9000000101c6ad20 9000300086524540 00000000082e0003 900030008bf57400 90000000050596bc 900030008bf57400 900000000434acac 0000000000000016 00007ffff224e060 00007fffbbe9f180 900030008bf57400 0000000000000000 9000000004341ce0 00007fffbbe9f180 00007ffff2369000 900030008549fec0 90000000054476ec 000000000000006b 9000000003f71da4 000000000000003a 00007ffff22b8a44 00007fffbbe9f8e0 00007fffbbe9e680 ffffffffffffffda 0000000000000000 0000000000000000 0000000000000000 00007fffbbe9f288 0000000000000000 0000000000000000 0000000000000039 84c2431493ceab6e 84c23ceb2827425e 0000000000000007 00007ffff2271600 ... Call Trace: [<900000000505960c>] __sock_release+0x4c/0xe0 [<90000000050596bc>] sock_close+0x1c/0x40 [<900000000434acac>] __fput+0xec/0x2e0 [<9000000004341ce0>] sys_close+0x40/0xa0 [<90000000054476ec>] do_syscall+0x8c/0xc0 [<9000000003f71da4>] handle_syscall+0xc4/0x160 Code: (Bad address in era) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3d50000 .text @ 0x9000000003f50000 .data @ 0x90000000055b0000 .bss @ 0x9000000006ca9400 ---[ end Kernel panic - not syncing: Fatal exception ]--- ''' This is because "sk->sk_prot->close" pointer is NULL in that case. This patch adds null check for it in inet_release() to fix this error. Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: Geliang Tang <[email protected]>
In the TRACE_EVENT(qdisc_reset) NULL dereference occurred from qdisc->dev_queue->dev <NULL> ->name This situation simulated from bunch of veths and Bluetooth disconnection and reconnection. During qdisc initialization, qdisc was being set to noop_queue. In veth_init_queue, the initial tx_num was reduced back to one, causing the qdisc reset to be called with noop, which led to the kernel panic. I've attached the GitHub gist link that C converted syz-execprogram source code and 3 log of reproduced vmcore-dmesg. https://gist.github.com/yskelg/cc64562873ce249cdd0d5a358b77d740 Yeoreum and I use two fuzzing tool simultaneously. One process with syz-executor : https://github.com/google/syzkaller $ ./syz-execprog -executor=./syz-executor -repeat=1 -sandbox=setuid \ -enable=none -collide=false log1 The other process with perf fuzzer: https://github.com/deater/perf_event_tests/tree/master/fuzzer $ perf_event_tests/fuzzer/perf_fuzzer I think this will happen on the kernel version. Linux kernel version +v6.7.10, +v6.8, +v6.9 and it could happen in v6.10. This occurred from 51270d5. I think this patch is absolutely necessary. Previously, It was showing not intended string value of name. I've reproduced 3 time from my fedora 40 Debug Kernel with any other module or patched. version: 6.10.0-0.rc2.20240608gitdc772f8237f9.29.fc41.aarch64+debug [ 5287.164555] veth0_vlan: left promiscuous mode [ 5287.164929] veth1_macvtap: left promiscuous mode [ 5287.164950] veth0_macvtap: left promiscuous mode [ 5287.164983] veth1_vlan: left promiscuous mode [ 5287.165008] veth0_vlan: left promiscuous mode [ 5287.165450] veth1_macvtap: left promiscuous mode [ 5287.165472] veth0_macvtap: left promiscuous mode [ 5287.165502] veth1_vlan: left promiscuous mode … [ 5297.598240] bridge0: port 2(bridge_slave_1) entered blocking state [ 5297.598262] bridge0: port 2(bridge_slave_1) entered forwarding state [ 5297.598296] bridge0: port 1(bridge_slave_0) entered blocking state [ 5297.598313] bridge0: port 1(bridge_slave_0) entered forwarding state [ 5297.616090] 8021q: adding VLAN 0 to HW filter on device bond0 [ 5297.620405] bridge0: port 1(bridge_slave_0) entered disabled state [ 5297.620730] bridge0: port 2(bridge_slave_1) entered disabled state [ 5297.627247] 8021q: adding VLAN 0 to HW filter on device team0 [ 5297.629636] bridge0: port 1(bridge_slave_0) entered blocking state … [ 5298.002798] bridge_slave_0: left promiscuous mode [ 5298.002869] bridge0: port 1(bridge_slave_0) entered disabled state [ 5298.309444] bond0 (unregistering): (slave bond_slave_0): Releasing backup interface [ 5298.315206] bond0 (unregistering): (slave bond_slave_1): Releasing backup interface [ 5298.320207] bond0 (unregistering): Released all slaves [ 5298.354296] hsr_slave_0: left promiscuous mode [ 5298.360750] hsr_slave_1: left promiscuous mode [ 5298.374889] veth1_macvtap: left promiscuous mode [ 5298.374931] veth0_macvtap: left promiscuous mode [ 5298.374988] veth1_vlan: left promiscuous mode [ 5298.375024] veth0_vlan: left promiscuous mode [ 5299.109741] team0 (unregistering): Port device team_slave_1 removed [ 5299.185870] team0 (unregistering): Port device team_slave_0 removed … [ 5300.155443] Bluetooth: hci3: unexpected cc 0x0c03 length: 249 > 1 [ 5300.155724] Bluetooth: hci3: unexpected cc 0x1003 length: 249 > 9 [ 5300.155988] Bluetooth: hci3: unexpected cc 0x1001 length: 249 > 9 …. [ 5301.075531] team0: Port device team_slave_1 added [ 5301.085515] bridge0: port 1(bridge_slave_0) entered blocking state [ 5301.085531] bridge0: port 1(bridge_slave_0) entered disabled state [ 5301.085588] bridge_slave_0: entered allmulticast mode [ 5301.085800] bridge_slave_0: entered promiscuous mode [ 5301.095617] bridge0: port 1(bridge_slave_0) entered blocking state [ 5301.095633] bridge0: port 1(bridge_slave_0) entered disabled state … [ 5301.149734] bond0: (slave bond_slave_0): Enslaving as an active interface with an up link [ 5301.173234] bond0: (slave bond_slave_0): Enslaving as an active interface with an up link [ 5301.180517] bond0: (slave bond_slave_1): Enslaving as an active interface with an up link [ 5301.193481] hsr_slave_0: entered promiscuous mode [ 5301.204425] hsr_slave_1: entered promiscuous mode [ 5301.210172] debugfs: Directory 'hsr0' with parent 'hsr' already present! [ 5301.210185] Cannot create hsr debugfs directory [ 5301.224061] bond0: (slave bond_slave_1): Enslaving as an active interface with an up link [ 5301.246901] bond0: (slave bond_slave_0): Enslaving as an active interface with an up link [ 5301.255934] team0: Port device team_slave_0 added [ 5301.256480] team0: Port device team_slave_1 added [ 5301.256948] team0: Port device team_slave_0 added … [ 5301.435928] hsr_slave_0: entered promiscuous mode [ 5301.446029] hsr_slave_1: entered promiscuous mode [ 5301.455872] debugfs: Directory 'hsr0' with parent 'hsr' already present! [ 5301.455884] Cannot create hsr debugfs directory [ 5301.502664] hsr_slave_0: entered promiscuous mode [ 5301.513675] hsr_slave_1: entered promiscuous mode [ 5301.526155] debugfs: Directory 'hsr0' with parent 'hsr' already present! [ 5301.526164] Cannot create hsr debugfs directory [ 5301.563662] hsr_slave_0: entered promiscuous mode [ 5301.576129] hsr_slave_1: entered promiscuous mode [ 5301.580259] debugfs: Directory 'hsr0' with parent 'hsr' already present! [ 5301.580270] Cannot create hsr debugfs directory [ 5301.590269] 8021q: adding VLAN 0 to HW filter on device bond0 [ 5301.595872] KASAN: null-ptr-deref in range [0x0000000000000130-0x0000000000000137] [ 5301.595877] Mem abort info: [ 5301.595881] ESR = 0x0000000096000006 [ 5301.595885] EC = 0x25: DABT (current EL), IL = 32 bits [ 5301.595889] SET = 0, FnV = 0 [ 5301.595893] EA = 0, S1PTW = 0 [ 5301.595896] FSC = 0x06: level 2 translation fault [ 5301.595900] Data abort info: [ 5301.595903] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000 [ 5301.595907] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 5301.595911] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 5301.595915] [dfff800000000026] address between user and kernel address ranges [ 5301.595971] Internal error: Oops: 0000000096000006 [kernel-patches#1] SMP … [ 5301.596076] CPU: 2 PID: 102769 Comm: syz-executor.3 Kdump: loaded Tainted: G W ------- --- 6.10.0-0.rc2.20240608gitdc772f8237f9.29.fc41.aarch64+debug kernel-patches#1 [ 5301.596080] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.21805430.BA64.2305221830 05/22/2023 [ 5301.596082] pstate: 01400005 (nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 5301.596085] pc : strnlen+0x40/0x88 [ 5301.596114] lr : trace_event_get_offsets_qdisc_reset+0x6c/0x2b0 [ 5301.596124] sp : ffff8000beef6b40 [ 5301.596126] x29: ffff8000beef6b40 x28: dfff800000000000 x27: 0000000000000001 [ 5301.596131] x26: 6de1800082c62bd0 x25: 1ffff000110aa9e0 x24: ffff800088554f00 [ 5301.596136] x23: ffff800088554ec0 x22: 0000000000000130 x21: 0000000000000140 [ 5301.596140] x20: dfff800000000000 x19: ffff8000beef6c60 x18: ffff7000115106d8 [ 5301.596143] x17: ffff800121bad000 x16: ffff800080020000 x15: 0000000000000006 [ 5301.596147] x14: 0000000000000002 x13: ffff0001f3ed8d14 x12: ffff700017ddeda5 [ 5301.596151] x11: 1ffff00017ddeda4 x10: ffff700017ddeda4 x9 : ffff800082cc5eec [ 5301.596155] x8 : 0000000000000004 x7 : 00000000f1f1f1f1 x6 : 00000000f2f2f200 [ 5301.596158] x5 : 00000000f3f3f3f3 x4 : ffff700017dded80 x3 : 00000000f204f1f1 [ 5301.596162] x2 : 0000000000000026 x1 : 0000000000000000 x0 : 0000000000000130 [ 5301.596166] Call trace: [ 5301.596175] strnlen+0x40/0x88 [ 5301.596179] trace_event_get_offsets_qdisc_reset+0x6c/0x2b0 [ 5301.596182] perf_trace_qdisc_reset+0xb0/0x538 [ 5301.596184] __traceiter_qdisc_reset+0x68/0xc0 [ 5301.596188] qdisc_reset+0x43c/0x5e8 [ 5301.596190] netif_set_real_num_tx_queues+0x288/0x770 [ 5301.596194] veth_init_queues+0xfc/0x130 [veth] [ 5301.596198] veth_newlink+0x45c/0x850 [veth] [ 5301.596202] rtnl_newlink_create+0x2c8/0x798 [ 5301.596205] __rtnl_newlink+0x92c/0xb60 [ 5301.596208] rtnl_newlink+0xd8/0x130 [ 5301.596211] rtnetlink_rcv_msg+0x2e0/0x890 [ 5301.596214] netlink_rcv_skb+0x1c4/0x380 [ 5301.596225] rtnetlink_rcv+0x20/0x38 [ 5301.596227] netlink_unicast+0x3c8/0x640 [ 5301.596231] netlink_sendmsg+0x658/0xa60 [ 5301.596234] __sock_sendmsg+0xd0/0x180 [ 5301.596243] __sys_sendto+0x1c0/0x280 [ 5301.596246] __arm64_sys_sendto+0xc8/0x150 [ 5301.596249] invoke_syscall+0xdc/0x268 [ 5301.596256] el0_svc_common.constprop.0+0x16c/0x240 [ 5301.596259] do_el0_svc+0x48/0x68 [ 5301.596261] el0_svc+0x50/0x188 [ 5301.596265] el0t_64_sync_handler+0x120/0x130 [ 5301.596268] el0t_64_sync+0x194/0x198 [ 5301.596272] Code: eb15001f 54000120 d343fc02 12000801 (38f46842) [ 5301.596285] SMP: stopping secondary CPUs [ 5301.597053] Starting crashdump kernel... [ 5301.597057] Bye! After applying our patch, I didn't find any kernel panic errors. We've found a simple reproducer # echo 1 > /sys/kernel/debug/tracing/events/qdisc/qdisc_reset/enable # ip link add veth0 type veth peer name veth1 Error: Unknown device type. However, without our patch applied, I tested upstream 6.10.0-rc3 kernel using the qdisc_reset event and the ip command on my qemu virtual machine. This 2 commands makes always kernel panic. Linux version: 6.10.0-rc3 [ 0.000000] Linux version 6.10.0-rc3-00164-g44ef20baed8e-dirty (paran@fedora) (gcc (GCC) 14.1.1 20240522 (Red Hat 14.1.1-4), GNU ld version 2.41-34.fc40) kernel-patches#20 SMP PREEMPT Sat Jun 15 16:51:25 KST 2024 Kernel panic message: [ 615.236484] Internal error: Oops: 0000000096000005 [kernel-patches#1] PREEMPT SMP [ 615.237250] Dumping ftrace buffer: [ 615.237679] (ftrace buffer empty) [ 615.238097] Modules linked in: veth crct10dif_ce virtio_gpu virtio_dma_buf drm_shmem_helper drm_kms_helper zynqmp_fpga xilinx_can xilinx_spi xilinx_selectmap xilinx_core xilinx_pr_decoupler versal_fpga uvcvideo uvc videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videodev videobuf2_common mc usbnet deflate zstd ubifs ubi rcar_canfd rcar_can omap_mailbox ntb_msi_test ntb_hw_epf lattice_sysconfig_spi lattice_sysconfig ice40_spi gpio_xilinx dwmac_altr_socfpga mdio_regmap stmmac_platform stmmac pcs_xpcs dfl_fme_region dfl_fme_mgr dfl_fme_br dfl_afu dfl fpga_region fpga_bridge can can_dev br_netfilter bridge stp llc atl1c ath11k_pci mhi ath11k_ahb ath11k qmi_helpers ath10k_sdio ath10k_pci ath10k_core ath mac80211 libarc4 cfg80211 drm fuse backlight ipv6 Jun 22 02:36:5[3 6k152.62-4sm98k4-0k]v kCePUr:n e1l :P IUDn:a b4le6 8t oC ohmma: nidpl eN oketr nteali nptaedg i6n.g1 0re.0q-urecs3t- 0at0 1v6i4r-tgu4a4le fa2d0dbraeeds0se-dir tyd f#f2f08 615.252376] Hardware name: linux,dummy-virt (DT) [ 615.253220] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 615.254433] pc : strnlen+0x6c/0xe0 [ 615.255096] lr : trace_event_get_offsets_qdisc_reset+0x94/0x3d0 [ 615.256088] sp : ffff800080b269a0 [ 615.256615] x29: ffff800080b269a0 x28: ffffc070f3f98500 x27: 0000000000000001 [ 615.257831] x26: 0000000000000010 x25: ffffc070f3f98540 x24: ffffc070f619cf60 [ 615.259020] x23: 0000000000000128 x22: 0000000000000138 x21: dfff800000000000 [ 615.260241] x20: ffffc070f631ad00 x19: 0000000000000128 x18: ffffc070f448b800 [ 615.261454] x17: 0000000000000000 x16: 0000000000000001 x15: ffffc070f4ba2a90 [ 615.262635] x14: ffff700010164d73 x13: 1ffff80e1e8d5eb3 x12: 1ffff00010164d72 [ 615.263877] x11: ffff700010164d72 x10: dfff800000000000 x9 : ffffc070e85d6184 [ 615.265047] x8 : ffffc070e4402070 x7 : 000000000000f1f1 x6 : 000000001504a6d3 [ 615.266336] x5 : ffff28ca21122140 x4 : ffffc070f5043ea8 x3 : 0000000000000000 [ 615.267528] x2 : 0000000000000025 x1 : 0000000000000000 x0 : 0000000000000000 [ 615.268747] Call trace: [ 615.269180] strnlen+0x6c/0xe0 [ 615.269767] trace_event_get_offsets_qdisc_reset+0x94/0x3d0 [ 615.270716] trace_event_raw_event_qdisc_reset+0xe8/0x4e8 [ 615.271667] __traceiter_qdisc_reset+0xa0/0x140 [ 615.272499] qdisc_reset+0x554/0x848 [ 615.273134] netif_set_real_num_tx_queues+0x360/0x9a8 [ 615.274050] veth_init_queues+0x110/0x220 [veth] [ 615.275110] veth_newlink+0x538/0xa50 [veth] [ 615.276172] __rtnl_newlink+0x11e4/0x1bc8 [ 615.276944] rtnl_newlink+0xac/0x120 [ 615.277657] rtnetlink_rcv_msg+0x4e4/0x1370 [ 615.278409] netlink_rcv_skb+0x25c/0x4f0 [ 615.279122] rtnetlink_rcv+0x48/0x70 [ 615.279769] netlink_unicast+0x5a8/0x7b8 [ 615.280462] netlink_sendmsg+0xa70/0x1190 Yeoreum and I don't know if the patch we wrote will fix the underlying cause, but we think that priority is to prevent kernel panic happening. So, we're sending this patch. Fixes: 51270d5 ("tracing/net_sched: Fix tracepoints that save qdisc_dev() as a string") Link: https://lore.kernel.org/lkml/[email protected]/t/ Cc: [email protected] Tested-by: Yunseong Kim <[email protected]> Signed-off-by: Yunseong Kim <[email protected]> Signed-off-by: Yeoreum Yun <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
The code in ocfs2_dio_end_io_write() estimates number of necessary transaction credits using ocfs2_calc_extend_credits(). This however does not take into account that the IO could be arbitrarily large and can contain arbitrary number of extents. Extent tree manipulations do often extend the current transaction but not in all of the cases. For example if we have only single block extents in the tree, ocfs2_mark_extent_written() will end up calling ocfs2_replace_extent_rec() all the time and we will never extend the current transaction and eventually exhaust all the transaction credits if the IO contains many single block extents. Once that happens a WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to this error. This was actually triggered by one of our customers on a heavily fragmented OCFS2 filesystem. To fix the issue make sure the transaction always has enough credits for one extent insert before each call of ocfs2_mark_extent_written(). Heming Zhao said: ------ PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error" PID: xxx TASK: xxxx CPU: 5 COMMAND: "SubmitThread-CA" #0 machine_kexec at ffffffff8c069932 kernel-patches#1 __crash_kexec at ffffffff8c1338fa kernel-patches#2 panic at ffffffff8c1d69b9 kernel-patches#3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2] kernel-patches#4 __ocfs2_abort at ffffffffc0c88387 [ocfs2] kernel-patches#5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2] kernel-patches#6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2] kernel-patches#7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2] kernel-patches#8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2] kernel-patches#9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2] kernel-patches#10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2] kernel-patches#11 dio_complete at ffffffff8c2b9fa7 kernel-patches#12 do_blockdev_direct_IO at ffffffff8c2bc09f kernel-patches#13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2] kernel-patches#14 generic_file_direct_write at ffffffff8c1dcf14 kernel-patches#15 __generic_file_write_iter at ffffffff8c1dd07b kernel-patches#16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2] kernel-patches#17 aio_write at ffffffff8c2cc72e kernel-patches#18 kmem_cache_alloc at ffffffff8c248dde kernel-patches#19 do_io_submit at ffffffff8c2ccada kernel-patches#20 do_syscall_64 at ffffffff8c004984 kernel-patches#21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: c15471f ("ocfs2: fix sparse file & data ordering issue in direct io") Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Reviewed-by: Heming Zhao <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
There is a race condition in the CMT interrupt handler. In the interrupt handler the driver sets a driver private flag, FLAG_IRQCONTEXT. This flag is used to indicate any call to set_next_event() should not be directly propagated to the device, but instead cached. This is done as the interrupt handler itself reprograms the device when needed before it completes and this avoids this operation to take place twice. It is unclear why this design was chosen, my suspicion is to allow the struct clock_event_device.event_handler callback, which is called while the FLAG_IRQCONTEXT is set, can update the next event without having to write to the device twice. Unfortunately there is a race between when the FLAG_IRQCONTEXT flag is set and later cleared where the interrupt handler have already started to write the next event to the device. If set_next_event() is called in this window the value is only cached in the driver but not written. This leads to the board to misbehave, or worse lockup and produce a splat. rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: 0-...!: (0 ticks this GP) idle=f5e0/0/0x0 softirq=519/519 fqs=0 (false positive?) rcu: (detected by 1, t=6502 jiffies, g=-595, q=77 ncpus=2) Sending NMI from CPU 1 to CPUs 0: NMI backtrace for cpu 0 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.10.0-rc5-arm64-renesas-00019-g74a6f86eaf1c-dirty #20 Hardware name: Renesas Salvator-X 2nd version board based on r8a77965 (DT) pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : tick_check_broadcast_expired+0xc/0x40 lr : cpu_idle_poll.isra.0+0x8c/0x168 sp : ffff800081c63d70 x29: ffff800081c63d70 x28: 00000000580000c8 x27: 00000000bfee5610 x26: 0000000000000027 x25: 0000000000000000 x24: 0000000000000000 x23: ffff00007fbb9100 x22: ffff8000818f1008 x21: ffff8000800ef07c x20: ffff800081c79ec0 x19: ffff800081c70c28 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffc2c717d8 x14: 0000000000000000 x13: ffff000009c18080 x12: ffff8000825f7fc0 x11: 0000000000000000 x10: ffff8000818f3cd4 x9 : 0000000000000028 x8 : ffff800081c79ec0 x7 : ffff800081c73000 x6 : 0000000000000000 x5 : 0000000000000000 x4 : ffff7ffffe286000 x3 : 0000000000000000 x2 : ffff7ffffe286000 x1 : ffff800082972900 x0 : ffff8000818f1008 Call trace: tick_check_broadcast_expired+0xc/0x40 do_idle+0x9c/0x280 cpu_startup_entry+0x34/0x40 kernel_init+0x0/0x11c do_one_initcall+0x0/0x260 __primary_switched+0x80/0x88 rcu: rcu_preempt kthread timer wakeup didn't happen for 6501 jiffies! g-595 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 rcu: Possible timer handling issue on cpu=0 timer-softirq=262 rcu: rcu_preempt kthread starved for 6502 jiffies! g-595 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0 rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_preempt state:I stack:0 pid:15 tgid:15 ppid:2 flags:0x00000008 Call trace: __switch_to+0xbc/0x100 __schedule+0x358/0xbe0 schedule+0x48/0x148 schedule_timeout+0xc4/0x138 rcu_gp_fqs_loop+0x12c/0x764 rcu_gp_kthread+0x208/0x298 kthread+0x10c/0x110 ret_from_fork+0x10/0x20 The design have been part of the driver since it was first merged in early 2009. It becomes increasingly harder to trigger the issue the older kernel version one tries. It only takes a few boots on v6.10-rc5, while hundreds of boots are needed to trigger it on v5.10. Close the race condition by using the CMT channel lock for the two competing sections. The channel lock was added to the driver after its initial design. Signed-off-by: Niklas Söderlund <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Daniel Lezcano <[email protected]>
The RISC-V kernel already has checks to ensure that memory which would lie outside of the linear mapping is not used. However those checks use memory_limit, which is used to implement the mem= kernel command line option (to limit the total amount of memory, not its address range). When memory is made up of two or more non-contiguous memory banks this check is incorrect. Two changes are made here: - add a call in setup_bootmem() to memblock_cap_memory_range() which will cause any memory which falls outside the linear mapping to be removed from the memory regions. - remove the check in create_linear_mapping_page_table() which was intended to remove memory which is outside the liner mapping based on memory_limit, as it is no longer needed. Note a check for mapping more memory than memory_limit (to implement mem=) is unnecessary because of the existing call to memblock_enforce_memory_limit(). This issue was seen when booting on a SV39 platform with two memory banks: 0x00,80000000 1GiB 0x20,00000000 32GiB This memory range is 158GiB from top to bottom, but the linear mapping is limited to 128GiB, so the lower block of RAM will be mapped at PAGE_OFFSET, and the upper block straddles the top of the linear mapping. This causes the following Oops: [ 0.000000] Linux version 6.10.0-rc2-gd3b8dd5b51dd-dirty ([email protected]) (riscv64-codasip-linux-gcc (GCC) 13.2.0, GNU ld (GNU Binutils) 2.41.0.20231213) #20 SMP Sat Jun 22 11:34:22 BST 2024 [ 0.000000] memblock_add: [0x0000000080000000-0x00000000bfffffff] early_init_dt_add_memory_arch+0x4a/0x52 [ 0.000000] memblock_add: [0x0000002000000000-0x00000027ffffffff] early_init_dt_add_memory_arch+0x4a/0x52 ... [ 0.000000] memblock_alloc_try_nid: 23724 bytes align=0x8 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 early_init_dt_alloc_memory_arch+0x1e/0x48 [ 0.000000] memblock_reserve: [0x00000027ffff5350-0x00000027ffffaffb] memblock_alloc_range_nid+0xb8/0x132 [ 0.000000] Unable to handle kernel paging request at virtual address fffffffe7fff5350 [ 0.000000] Oops [#1] [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.10.0-rc2-gd3b8dd5b51dd-dirty #20 [ 0.000000] Hardware name: codasip,a70x (DT) [ 0.000000] epc : __memset+0x8c/0x104 [ 0.000000] ra : memblock_alloc_try_nid+0x74/0x84 [ 0.000000] epc : ffffffff805e88c8 ra : ffffffff806148f6 sp : ffffffff80e03d50 [ 0.000000] gp : ffffffff80ec4158 tp : ffffffff80e0bec0 t0 : fffffffe7fff52f8 [ 0.000000] t1 : 00000027ffffb000 t2 : 5f6b636f6c626d65 s0 : ffffffff80e03d90 [ 0.000000] s1 : 0000000000005cac a0 : fffffffe7fff5350 a1 : 0000000000000000 [ 0.000000] a2 : 0000000000005cac a3 : fffffffe7fffaff8 a4 : 000000000000002c [ 0.000000] a5 : ffffffff805e88c8 a6 : 0000000000005cac a7 : 0000000000000030 [ 0.000000] s2 : fffffffe7fff5350 s3 : ffffffffffffffff s4 : 0000000000000000 [ 0.000000] s5 : ffffffff8062347e s6 : 0000000000000000 s7 : 0000000000000001 [ 0.000000] s8 : 0000000000002000 s9 : 00000000800226d0 s10: 0000000000000000 [ 0.000000] s11: 0000000000000000 t3 : ffffffff8080a928 t4 : ffffffff8080a928 [ 0.000000] t5 : ffffffff8080a928 t6 : ffffffff8080a940 [ 0.000000] status: 0000000200000100 badaddr: fffffffe7fff5350 cause: 000000000000000f [ 0.000000] [<ffffffff805e88c8>] __memset+0x8c/0x104 [ 0.000000] [<ffffffff8062349c>] early_init_dt_alloc_memory_arch+0x1e/0x48 [ 0.000000] [<ffffffff8043e892>] __unflatten_device_tree+0x52/0x114 [ 0.000000] [<ffffffff8062441e>] unflatten_device_tree+0x9e/0xb8 [ 0.000000] [<ffffffff806046fe>] setup_arch+0xd4/0x5bc [ 0.000000] [<ffffffff806007aa>] start_kernel+0x76/0x81a [ 0.000000] Code: b823 02b2 bc23 02b2 b023 04b2 b423 04b2 b823 04b2 (bc23) 04b2 [ 0.000000] ---[ end trace 0000000000000000 ]--- [ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task! [ 0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]--- The problem is that memblock (unaware that some physical memory cannot be used) has allocated memory from the top of memory but which is outside the linear mapping region. Signed-off-by: Stuart Menefy <[email protected]> Fixes: c99127c ("riscv: Make sure the linear mapping does not use the kernel mapping") Reviewed-by: David McKay <[email protected]> Reviewed-by: Alexandre Ghiti <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
iter_finish_branch_entry() doesn't put the branch_info from/to map elements creating memory leaks. This can be seen with: ``` $ perf record -e cycles -b perf test -w noploop $ perf report -D ... Direct leak of 984344 byte(s) in 123043 object(s) allocated from: #0 0x7fb2654f3bd7 in malloc libsanitizer/asan/asan_malloc_linux.cpp:69 #1 0x564d3400d10b in map__get util/map.h:186 #2 0x564d3400d10b in ip__resolve_ams util/machine.c:1981 #3 0x564d34014d81 in sample__resolve_bstack util/machine.c:2151 #4 0x564d34094790 in iter_prepare_branch_entry util/hist.c:898 #5 0x564d34098fa4 in hist_entry_iter__add util/hist.c:1238 #6 0x564d33d1f0c7 in process_sample_event tools/perf/builtin-report.c:334 #7 0x564d34031eb7 in perf_session__deliver_event util/session.c:1655 #8 0x564d3403ba52 in do_flush util/ordered-events.c:245 #9 0x564d3403ba52 in __ordered_events__flush util/ordered-events.c:324 #10 0x564d3402d32e in perf_session__process_user_event util/session.c:1708 #11 0x564d34032480 in perf_session__process_event util/session.c:1877 #12 0x564d340336ad in reader__read_event util/session.c:2399 #13 0x564d34033fdc in reader__process_events util/session.c:2448 #14 0x564d34033fdc in __perf_session__process_events util/session.c:2495 #15 0x564d34033fdc in perf_session__process_events util/session.c:2661 #16 0x564d33d27113 in __cmd_report tools/perf/builtin-report.c:1065 #17 0x564d33d27113 in cmd_report tools/perf/builtin-report.c:1805 #18 0x564d33e0ccb7 in run_builtin tools/perf/perf.c:350 #19 0x564d33e0d45e in handle_internal_command tools/perf/perf.c:403 #20 0x564d33cdd827 in run_argv tools/perf/perf.c:447 #21 0x564d33cdd827 in main tools/perf/perf.c:561 ... ``` Clearing up the map_symbols properly creates maps reference count issues so resolve those. Resolving this issue doesn't improve peak heap consumption for the test above. Committer testing: $ sudo dnf install libasan $ make -k CORESIGHT=1 EXTRA_CFLAGS="-fsanitize=address" CC=clang O=/tmp/build/$(basename $PWD)/ -C tools/perf install-bin Reviewed-by: Kan Liang <[email protected]> Signed-off-by: Ian Rogers <[email protected]> Tested-by: Arnaldo Carvalho de Melo <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sun Haiyong <[email protected]> Cc: Yanteng Si <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
…s_lock For storing a value to a queue attribute, the queue_attr_store function first freezes the queue (->q_usage_counter(io)) and then acquire ->sysfs_lock. This seems not correct as the usual ordering should be to acquire ->sysfs_lock before freezing the queue. This incorrect ordering causes the following lockdep splat which we are able to reproduce always simply by accessing /sys/kernel/debug file using ls command: [ 57.597146] WARNING: possible circular locking dependency detected [ 57.597154] 6.12.0-10553-gb86545e02e8c #20 Tainted: G W [ 57.597162] ------------------------------------------------------ [ 57.597168] ls/4605 is trying to acquire lock: [ 57.597176] c00000003eb56710 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x58/0xc0 [ 57.597200] but task is already holding lock: [ 57.597207] c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 [ 57.597226] which lock already depends on the new lock. [ 57.597233] the existing dependency chain (in reverse order) is: [ 57.597241] -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}: [ 57.597255] down_write+0x6c/0x18c [ 57.597264] start_creating+0xb4/0x24c [ 57.597274] debugfs_create_dir+0x2c/0x1e8 [ 57.597283] blk_register_queue+0xec/0x294 [ 57.597292] add_disk_fwnode+0x2e4/0x548 [ 57.597302] brd_alloc+0x2c8/0x338 [ 57.597309] brd_init+0x100/0x178 [ 57.597317] do_one_initcall+0x88/0x3e4 [ 57.597326] kernel_init_freeable+0x3cc/0x6e0 [ 57.597334] kernel_init+0x34/0x1cc [ 57.597342] ret_from_kernel_user_thread+0x14/0x1c [ 57.597350] -> #4 (&q->debugfs_mutex){+.+.}-{4:4}: [ 57.597362] __mutex_lock+0xfc/0x12a0 [ 57.597370] blk_register_queue+0xd4/0x294 [ 57.597379] add_disk_fwnode+0x2e4/0x548 [ 57.597388] brd_alloc+0x2c8/0x338 [ 57.597395] brd_init+0x100/0x178 [ 57.597402] do_one_initcall+0x88/0x3e4 [ 57.597410] kernel_init_freeable+0x3cc/0x6e0 [ 57.597418] kernel_init+0x34/0x1cc [ 57.597426] ret_from_kernel_user_thread+0x14/0x1c [ 57.597434] -> #3 (&q->sysfs_lock){+.+.}-{4:4}: [ 57.597446] __mutex_lock+0xfc/0x12a0 [ 57.597454] queue_attr_store+0x9c/0x110 [ 57.597462] sysfs_kf_write+0x70/0xb0 [ 57.597471] kernfs_fop_write_iter+0x1b0/0x2ac [ 57.597480] vfs_write+0x3dc/0x6e8 [ 57.597488] ksys_write+0x84/0x140 [ 57.597495] system_call_exception+0x130/0x360 [ 57.597504] system_call_common+0x160/0x2c4 [ 57.597516] -> #2 (&q->q_usage_counter(io)#21){++++}-{0:0}: [ 57.597530] __submit_bio+0x5ec/0x828 [ 57.597538] submit_bio_noacct_nocheck+0x1e4/0x4f0 [ 57.597547] iomap_readahead+0x2a0/0x448 [ 57.597556] xfs_vm_readahead+0x28/0x3c [ 57.597564] read_pages+0x88/0x41c [ 57.597571] page_cache_ra_unbounded+0x1ac/0x2d8 [ 57.597580] filemap_get_pages+0x188/0x984 [ 57.597588] filemap_read+0x13c/0x4bc [ 57.597596] xfs_file_buffered_read+0x88/0x17c [ 57.597605] xfs_file_read_iter+0xac/0x158 [ 57.597614] vfs_read+0x2d4/0x3b4 [ 57.597622] ksys_read+0x84/0x144 [ 57.597629] system_call_exception+0x130/0x360 [ 57.597637] system_call_common+0x160/0x2c4 [ 57.597647] -> #1 (mapping.invalidate_lock#2){++++}-{4:4}: [ 57.597661] down_read+0x6c/0x220 [ 57.597669] filemap_fault+0x870/0x100c [ 57.597677] xfs_filemap_fault+0xc4/0x18c [ 57.597684] __do_fault+0x64/0x164 [ 57.597693] __handle_mm_fault+0x1274/0x1dac [ 57.597702] handle_mm_fault+0x248/0x484 [ 57.597711] ___do_page_fault+0x428/0xc0c [ 57.597719] hash__do_page_fault+0x30/0x68 [ 57.597727] do_hash_fault+0x90/0x35c [ 57.597736] data_access_common_virt+0x210/0x220 [ 57.597745] _copy_from_user+0xf8/0x19c [ 57.597754] sel_write_load+0x178/0xd54 [ 57.597762] vfs_write+0x108/0x6e8 [ 57.597769] ksys_write+0x84/0x140 [ 57.597777] system_call_exception+0x130/0x360 [ 57.597785] system_call_common+0x160/0x2c4 [ 57.597794] -> #0 (&mm->mmap_lock){++++}-{4:4}: [ 57.597806] __lock_acquire+0x17cc/0x2330 [ 57.597814] lock_acquire+0x138/0x400 [ 57.597822] __might_fault+0x7c/0xc0 [ 57.597830] filldir64+0xe8/0x390 [ 57.597839] dcache_readdir+0x80/0x2d4 [ 57.597846] iterate_dir+0xd8/0x1d4 [ 57.597855] sys_getdents64+0x88/0x2d4 [ 57.597864] system_call_exception+0x130/0x360 [ 57.597872] system_call_common+0x160/0x2c4 [ 57.597881] other info that might help us debug this: [ 57.597888] Chain exists of: &mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#3 [ 57.597905] Possible unsafe locking scenario: [ 57.597911] CPU0 CPU1 [ 57.597917] ---- ---- [ 57.597922] rlock(&sb->s_type->i_mutex_key#3); [ 57.597932] lock(&q->debugfs_mutex); [ 57.597940] lock(&sb->s_type->i_mutex_key#3); [ 57.597950] rlock(&mm->mmap_lock); [ 57.597958] *** DEADLOCK *** [ 57.597965] 2 locks held by ls/4605: [ 57.597971] #0: c0000000137c12f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0xcc/0x154 [ 57.597989] #1: c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 Prevent the above lockdep warning by acquiring ->sysfs_lock before freezing the queue while storing a queue attribute in queue_attr_store function. Later, we also found[1] another function __blk_mq_update_nr_ hw_queues where we first freeze queue and then acquire the ->sysfs_lock. So we've also updated lock ordering in __blk_mq_update_nr_hw_queues function and ensured that in all code paths we follow the correct lock ordering i.e. acquire ->sysfs_lock before freezing the queue. [1] https://lore.kernel.org/all/CAFj5m9Ke8+EHKQBs_Nk6hqd=LGXtk4mUxZUN5==ZcCjnZSBwHw@mail.gmail.com/ Reported-by: [email protected] Fixes: af28141 ("block: freeze the queue in queue_attr_store") Tested-by: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Nilay Shroff <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> #6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch #7 -> #11: preparations Patch #12: MM owner tracking for large folios Patch #13: COW reuse for PTE-mapped anon THP Patch #14: folio_maybe_mapped_shared() Patch #15 -> #20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch #15 -> #20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/[email protected]/T/ [3] https://lkml.kernel.org/r/[email protected] [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Lance Yang <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Andy Lutomirks^H^Hski <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Michal Koutn <[email protected]> Cc: Muchun Song <[email protected]> Cc: tejun heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
I found a NULL pointer dereference as followed: BUG: kernel NULL pointer dereference, address: 0000000000000028 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 5 UID: 0 PID: 5964 Comm: sh Kdump: loaded Not tainted 6.13.0-dirty #20 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1. RIP: 0010:has_unmovable_pages+0x184/0x360 ... Call Trace: <TASK> set_migratetype_isolate+0xd1/0x180 start_isolate_page_range+0xd2/0x170 alloc_contig_range_noprof+0x101/0x660 alloc_contig_pages_noprof+0x238/0x290 alloc_gigantic_folio.isra.0+0xb6/0x1f0 only_alloc_fresh_hugetlb_folio.isra.0+0xf/0x60 alloc_pool_huge_folio+0x80/0xf0 set_max_huge_pages+0x211/0x490 __nr_hugepages_store_common+0x5f/0xe0 nr_hugepages_store+0x77/0x80 kernfs_fop_write_iter+0x118/0x200 vfs_write+0x23c/0x3f0 ksys_write+0x62/0xe0 do_syscall_64+0x5b/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e As has_unmovable_pages() call folio_hstate() without hugetlb_lock, there is a race to free the HugeTLB page between PageHuge() and folio_hstate(). There is no need to add hugetlb_lock here as the HugeTLB page can be freed in lot of places. So it's enough to unfold folio_hstate() and add a check to avoid NULL pointer dereference for hugepage_migration_supported(). Link: https://lkml.kernel.org/r/[email protected] Fixes: 464c7ff ("mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.") Signed-off-by: Liu Shixin <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Zi Yan <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Kirill A. Shuemov <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nanyong Sun <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
The following crash is observed while handling an IOMMU fault with a recent kernel: kernel tried to execute NX-protected page - exploit attempt? (uid: 0) BUG: unable to handle page fault for address: ffff8c708299f700 PGD 19ee01067 P4D 19ee01067 PUD 101c10063 PMD 80000001028001e3 Oops: Oops: 0011 [#1] SMP NOPTI CPU: 4 UID: 0 PID: 139 Comm: irq/25-AMD-Vi Not tainted 6.15.0-rc1+ #20 PREEMPT(lazy) Hardware name: LENOVO 21D0/LNVNB161216, BIOS J6CN50WW 09/27/2024 RIP: 0010:0xffff8c708299f700 Call Trace: <TASK> ? report_iommu_fault+0x78/0xd3 ? amd_iommu_report_page_fault+0x91/0x150 ? amd_iommu_int_thread+0x77/0x180 ? __pfx_irq_thread_fn+0x10/0x10 ? irq_thread_fn+0x23/0x60 ? irq_thread+0xf9/0x1e0 ? __pfx_irq_thread_dtor+0x10/0x10 ? __pfx_irq_thread+0x10/0x10 ? kthread+0xfc/0x240 ? __pfx_kthread+0x10/0x10 ? ret_from_fork+0x34/0x50 ? __pfx_kthread+0x10/0x10 ? ret_from_fork_asm+0x1a/0x30 </TASK> report_iommu_fault() checks for an installed handler comparing the corresponding field to NULL. It can (and could before) be called for a domain with a different cookie type - IOMMU_COOKIE_DMA_IOVA, specifically. Cookie is represented as a union so we may end up with a garbage value treated there if this happens for a domain with another cookie type. Formerly there were two exclusive cookie types in the union. IOMMU_DOMAIN_SVA has a dedicated iommu_report_device_fault(). Call the fault handler only if the passed domain has a required cookie type. Found by Linux Verification Center (linuxtesting.org). Fixes: 6aa63a4 ("iommu: Sort out domain user data") Signed-off-by: Fedor Pchelkin <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
Pull request for series with
subject: bpf: permit map_ptr arithmetic with opcode add and offset 0
version: 3
url: https://patchwork.ozlabs.org/project/netdev/list/?series=200277