Skip to content

tools: bpftool: support creating outer maps #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

kernel-patches-bot
Copy link

Pull request for series with
subject: tools: bpftool: support creating outer maps
version: 2
url: https://patchwork.ozlabs.org/project/netdev/list/?series=200008

@kernel-patches-bot
Copy link
Author

@kernel-patches-bot
Copy link
Author

@kernel-patches-bot
Copy link
Author

kernel-patches-bot and others added 3 commits September 9, 2020 13:11
follow, as a consequence to earlier refactorings. There is a variable
("num_elems") which does not appear to be necessary, and the error
handling would look cleaner if moved to its own function. Let's clean it
up. No functional change.

v2:
- v1 was erroneously removing the check on fd maps in an attempt to get
  support for outer map dumps. This is already working. Instead, v2
  focuses on cleaning up the dump_map_elem() function, to avoid
  similar confusion in the future.

Signed-off-by: Quentin Monnet <[email protected]>
---
 tools/bpf/bpftool/map.c | 101 +++++++++++++++++++++-------------------
 1 file changed, 52 insertions(+), 49 deletions(-)
hash-of-map in bpftool. This is because the kernel needs an inner_map_fd
to collect metadata on the inner maps to be supported by the new map,
but bpftool does not provide a way to pass this file descriptor.

Add a new optional "inner_map" keyword that can be used to pass a
reference to a map, retrieve a fd to that map, and pass it as the
inner_map_fd.

Add related documentation and bash completion. Note that we can
reference the inner map by its name, meaning we can have several times
the keyword "name" with different meanings (mandatory outer map name,
and possibly a name to use to find the inner_map_fd). The bash
completion will offer it just once, and will not suggest "name" on the
following command:

    # bpftool map create /sys/fs/bpf/my_outer_map type hash_of_maps \
        inner_map name my_inner_map [TAB]

Fixing that specific case seems too convoluted. Completion will work as
expected, however, if the outer map name comes first and the "inner_map
name ..." is passed second.

Signed-off-by: Quentin Monnet <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
---
 .../bpf/bpftool/Documentation/bpftool-map.rst | 10 +++-
 tools/bpf/bpftool/bash-completion/bpftool     | 22 ++++++++-
 tools/bpf/bpftool/map.c                       | 48 +++++++++++++------
 3 files changed, 62 insertions(+), 18 deletions(-)
@kernel-patches-bot
Copy link
Author

@kernel-patches-bot
Copy link
Author

At least one diff in series https://patchwork.ozlabs.org/project/netdev/list/?series=200008 expired. Closing PR.

kernel-patches-bot pushed a commit that referenced this pull request Sep 16, 2020
…s metrics" test

Linux 5.9 introduced perf test case "Parse and process metrics" and
on s390 this test case always dumps core:

  [root@t35lp67 perf]# ./perf test -vvvv -F 67
  67: Parse and process metrics                             :
  --- start ---
  metric expr inst_retired.any / cpu_clk_unhalted.thread for IPC
  parsing metric: inst_retired.any / cpu_clk_unhalted.thread
  Segmentation fault (core dumped)
  [root@t35lp67 perf]#

I debugged this core dump and gdb shows this call chain:

  (gdb) where
   #0  0x000003ffabc3192a in __strnlen_c_1 () from /lib64/libc.so.6
   #1  0x000003ffabc293de in strcasestr () from /lib64/libc.so.6
   #2  0x0000000001102ba2 in match_metric(list=0x1e6ea20 "inst_retired.any",
            n=<optimized out>)
       at util/metricgroup.c:368
   #3  find_metric (map=<optimized out>, map=<optimized out>,
           metric=0x1e6ea20 "inst_retired.any")
      at util/metricgroup.c:765
   #4  __resolve_metric (ids=0x0, map=<optimized out>, metric_list=0x0,
           metric_no_group=<optimized out>, m=<optimized out>)
      at util/metricgroup.c:844
   #5  resolve_metric (ids=0x0, map=0x0, metric_list=0x0,
          metric_no_group=<optimized out>)
      at util/metricgroup.c:881
   #6  metricgroup__add_metric (metric=<optimized out>,
        metric_no_group=metric_no_group@entry=false, events=<optimized out>,
        events@entry=0x3ffd84fb878, metric_list=0x0,
        metric_list@entry=0x3ffd84fb868, map=0x0)
      at util/metricgroup.c:943
   #7  0x00000000011034ae in metricgroup__add_metric_list (map=0x13f9828 <map>,
        metric_list=0x3ffd84fb868, events=0x3ffd84fb878,
        metric_no_group=<optimized out>, list=<optimized out>)
      at util/metricgroup.c:988
   #8  parse_groups (perf_evlist=perf_evlist@entry=0x1e70260,
          str=str@entry=0x12f34b2 "IPC", metric_no_group=<optimized out>,
          metric_no_merge=<optimized out>,
          fake_pmu=fake_pmu@entry=0x1462f18 <perf_pmu.fake>,
          metric_events=0x3ffd84fba58, map=0x1)
      at util/metricgroup.c:1040
   #9  0x0000000001103eb2 in metricgroup__parse_groups_test(
  	evlist=evlist@entry=0x1e70260, map=map@entry=0x13f9828 <map>,
  	str=str@entry=0x12f34b2 "IPC",
  	metric_no_group=metric_no_group@entry=false,
  	metric_no_merge=metric_no_merge@entry=false,
  	metric_events=0x3ffd84fba58)
      at util/metricgroup.c:1082
   #10 0x00000000010c84d8 in __compute_metric (ratio2=0x0, name2=0x0,
          ratio1=<synthetic pointer>, name1=0x12f34b2 "IPC",
  	vals=0x3ffd84fbad8, name=0x12f34b2 "IPC")
      at tests/parse-metric.c:159
   #11 compute_metric (ratio=<synthetic pointer>, vals=0x3ffd84fbad8,
  	name=0x12f34b2 "IPC")
      at tests/parse-metric.c:189
   #12 test_ipc () at tests/parse-metric.c:208
.....
..... omitted many more lines

This test case was added with
commit 218ca91 ("perf tests: Add parse metric test for frontend metric").

When I compile with make DEBUG=y it works fine and I do not get a core dump.

It turned out that the above listed function call chain worked on a struct
pmu_event array which requires a trailing element with zeroes which was
missing. The marco map_for_each_event() loops over that array tests for members
metric_expr/metric_name/metric_group being non-NULL. Adding this element fixes
the issue.

Output after:

  [root@t35lp46 perf]# ./perf test 67
  67: Parse and process metrics                             : Ok
  [root@t35lp46 perf]#

Committer notes:

As Ian remarks, this is not s390 specific:

<quote Ian>
  This also shows up with address sanitizer on all architectures
  (perhaps change the patch title) and perhaps add a "Fixes: <commit>"
  tag.

  =================================================================
  ==4718==ERROR: AddressSanitizer: global-buffer-overflow on address
  0x55c93b4d59e8 at pc 0x55c93a1541e2 bp 0x7ffd24327c60 sp
  0x7ffd24327c58
  READ of size 8 at 0x55c93b4d59e8 thread T0
      #0 0x55c93a1541e1 in find_metric tools/perf/util/metricgroup.c:764:2
      #1 0x55c93a153e6c in __resolve_metric tools/perf/util/metricgroup.c:844:9
      #2 0x55c93a152f18 in resolve_metric tools/perf/util/metricgroup.c:881:9
      #3 0x55c93a1528db in metricgroup__add_metric
  tools/perf/util/metricgroup.c:943:9
      #4 0x55c93a151996 in metricgroup__add_metric_list
  tools/perf/util/metricgroup.c:988:9
      #5 0x55c93a1511b9 in parse_groups tools/perf/util/metricgroup.c:1040:8
      #6 0x55c93a1513e1 in metricgroup__parse_groups_test
  tools/perf/util/metricgroup.c:1082:9
      #7 0x55c93a0108ae in __compute_metric tools/perf/tests/parse-metric.c:159:8
      #8 0x55c93a010744 in compute_metric tools/perf/tests/parse-metric.c:189:9
      #9 0x55c93a00f5ee in test_ipc tools/perf/tests/parse-metric.c:208:2
      #10 0x55c93a00f1e8 in test__parse_metric
  tools/perf/tests/parse-metric.c:345:2
      #11 0x55c939fd7202 in run_test tools/perf/tests/builtin-test.c:410:9
      #12 0x55c939fd6736 in test_and_print tools/perf/tests/builtin-test.c:440:9
      #13 0x55c939fd58c3 in __cmd_test tools/perf/tests/builtin-test.c:661:4
      #14 0x55c939fd4e02 in cmd_test tools/perf/tests/builtin-test.c:807:9
      #15 0x55c939e4763d in run_builtin tools/perf/perf.c:313:11
      #16 0x55c939e46475 in handle_internal_command tools/perf/perf.c:365:8
      #17 0x55c939e4737e in run_argv tools/perf/perf.c:409:2
      #18 0x55c939e45f7e in main tools/perf/perf.c:539:3

  0x55c93b4d59e8 is located 0 bytes to the right of global variable
  'pme_test' defined in 'tools/perf/tests/parse-metric.c:17:25'
  (0x55c93b4d54a0) of size 1352
  SUMMARY: AddressSanitizer: global-buffer-overflow
  tools/perf/util/metricgroup.c:764:2 in find_metric
  Shadow bytes around the buggy address:
    0x0ab9a7692ae0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692af0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  =>0x0ab9a7692b30: 00 00 00 00 00 00 00 00 00 00 00 00 00[f9]f9 f9
    0x0ab9a7692b40: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
    0x0ab9a7692b50: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
    0x0ab9a7692b60: f9 f9 f9 f9 f9 f9 f9 f9 00 00 00 00 00 00 00 00
    0x0ab9a7692b70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b80: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
  Shadow byte legend (one shadow byte represents 8 application bytes):
    Addressable:           00
    Partially addressable: 01 02 03 04 05 06 07
    Heap left redzone:	   fa
    Freed heap region:	   fd
    Stack left redzone:	   f1
    Stack mid redzone:	   f2
    Stack right redzone:     f3
    Stack after return:	   f5
    Stack use after scope:   f8
    Global redzone:          f9
    Global init order:	   f6
    Poisoned by user:        f7
    Container overflow:	   fc
    Array cookie:            ac
    Intra object redzone:    bb
    ASan internal:           fe
    Left alloca redzone:     ca
    Right alloca redzone:    cb
    Shadow gap:              cc
</quote>

I'm also adding the missing "Fixes" tag and setting just .name to NULL,
as doing it that way is more compact (the compiler will zero out
everything else) and the table iterators look for .name being NULL as
the sentinel marking the end of the table.

Fixes: 0a507af ("perf tests: Add parse metric test for ipc metric")
Signed-off-by: Thomas Richter <[email protected]>
Reviewed-by: Sumanth Korikkar <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Sep 24, 2020
The evsel->unit borrows a pointer of pmu event or alias instead of
owns a string.  But tool event (duration_time) passes a result of
strdup() caused a leak.

It was found by ASAN during metric test:

  Direct leak of 210 byte(s) in 70 object(s) allocated from:
    #0 0x7fe366fca0b5 in strdup (/lib/x86_64-linux-gnu/libasan.so.5+0x920b5)
    #1 0x559fbbcc6ea3 in add_event_tool util/parse-events.c:414
    #2 0x559fbbcc6ea3 in parse_events_add_tool util/parse-events.c:1414
    #3 0x559fbbd8474d in parse_events_parse util/parse-events.y:439
    #4 0x559fbbcc95da in parse_events__scanner util/parse-events.c:2096
    #5 0x559fbbcc95da in __parse_events util/parse-events.c:2141
    #6 0x559fbbc28555 in check_parse_id tests/pmu-events.c:406
    #7 0x559fbbc28555 in check_parse_id tests/pmu-events.c:393
    #8 0x559fbbc28555 in check_parse_cpu tests/pmu-events.c:415
    #9 0x559fbbc28555 in test_parsing tests/pmu-events.c:498
    #10 0x559fbbc0109b in run_test tests/builtin-test.c:410
    #11 0x559fbbc0109b in test_and_print tests/builtin-test.c:440
    #12 0x559fbbc03e69 in __cmd_test tests/builtin-test.c:695
    #13 0x559fbbc03e69 in cmd_test tests/builtin-test.c:807
    #14 0x559fbbc691f4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #15 0x559fbbb071a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #16 0x559fbbb071a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #17 0x559fbbb071a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #18 0x7fe366b68cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: f0fbb11 ("perf stat: Implement duration_time as a proper event")
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Sep 24, 2020
The test_generic_metric() missed to release entries in the pctx.  Asan
reported following leak (and more):

  Direct leak of 128 byte(s) in 1 object(s) allocated from:
    #0 0x7f4c9396980e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e)
    #1 0x55f7e748cc14 in hashmap_grow (/home/namhyung/project/linux/tools/perf/perf+0x90cc14)
    #2 0x55f7e748d497 in hashmap__insert (/home/namhyung/project/linux/tools/perf/perf+0x90d497)
    #3 0x55f7e7341667 in hashmap__set /home/namhyung/project/linux/tools/perf/util/hashmap.h:111
    #4 0x55f7e7341667 in expr__add_ref util/expr.c:120
    #5 0x55f7e7292436 in prepare_metric util/stat-shadow.c:783
    #6 0x55f7e729556d in test_generic_metric util/stat-shadow.c:858
    #7 0x55f7e712390b in compute_single tests/parse-metric.c:128
    #8 0x55f7e712390b in __compute_metric tests/parse-metric.c:180
    #9 0x55f7e712446d in compute_metric tests/parse-metric.c:196
    #10 0x55f7e712446d in test_dcache_l2 tests/parse-metric.c:295
    #11 0x55f7e712446d in test__parse_metric tests/parse-metric.c:355
    #12 0x55f7e70be09b in run_test tests/builtin-test.c:410
    #13 0x55f7e70be09b in test_and_print tests/builtin-test.c:440
    #14 0x55f7e70c101a in __cmd_test tests/builtin-test.c:661
    #15 0x55f7e70c101a in cmd_test tests/builtin-test.c:807
    #16 0x55f7e7126214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #17 0x55f7e6fc41a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #18 0x55f7e6fc41a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #19 0x55f7e6fc41a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #20 0x7f4c93492cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: 6d432c4 ("perf tools: Add test_generic_metric function")
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Sep 24, 2020
The metricgroup__add_metric() can find multiple match for a metric group
and it's possible to fail.  Also it can fail in the middle like in
resolve_metric() even for single metric.

In those cases, the intermediate list and ids will be leaked like:

  Direct leak of 3 byte(s) in 1 object(s) allocated from:
    #0 0x7f4c938f40b5 in strdup (/lib/x86_64-linux-gnu/libasan.so.5+0x920b5)
    #1 0x55f7e71c1bef in __add_metric util/metricgroup.c:683
    #2 0x55f7e71c31d0 in add_metric util/metricgroup.c:906
    #3 0x55f7e71c3844 in metricgroup__add_metric util/metricgroup.c:940
    #4 0x55f7e71c488d in metricgroup__add_metric_list util/metricgroup.c:993
    #5 0x55f7e71c488d in parse_groups util/metricgroup.c:1045
    #6 0x55f7e71c60a4 in metricgroup__parse_groups_test util/metricgroup.c:1087
    #7 0x55f7e71235ae in __compute_metric tests/parse-metric.c:164
    #8 0x55f7e7124650 in compute_metric tests/parse-metric.c:196
    #9 0x55f7e7124650 in test_recursion_fail tests/parse-metric.c:318
    #10 0x55f7e7124650 in test__parse_metric tests/parse-metric.c:356
    #11 0x55f7e70be09b in run_test tests/builtin-test.c:410
    #12 0x55f7e70be09b in test_and_print tests/builtin-test.c:440
    #13 0x55f7e70c101a in __cmd_test tests/builtin-test.c:661
    #14 0x55f7e70c101a in cmd_test tests/builtin-test.c:807
    #15 0x55f7e7126214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #16 0x55f7e6fc41a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #17 0x55f7e6fc41a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #18 0x55f7e6fc41a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #19 0x7f4c93492cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: 83de0b7 ("perf metric: Collect referenced metrics in struct metric_ref_node")
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Sep 24, 2020
Ido Schimmel says:

====================
mlxsw: Refactor headroom management

Petr says:

On Spectrum, port buffers, also called port headroom, is where packets are
stored while they are parsed and the forwarding decision is being made. For
lossless traffic flows, in case shared buffer admission is not allowed,
headroom is also where to put the extra traffic received before the sent
PAUSE takes effect. Another aspect of the port headroom is the so called
internal buffer, which is used for egress mirroring.

Linux supports two DCB interfaces related to the headroom: dcbnl_setbuffer
for configuration, and dcbnl_getbuffer for inspection. In order to make it
possible to implement these interfaces, it is first necessary to clean up
headroom handling, which is currently strewn in several places in the
driver.

The end goal is an architecture whereby it is possible to take a copy of
the current configuration, adjust parameters, and then hand the proposed
configuration over to the system to implement it. When everything works,
the proposed configuration is accepted and saved. First, this centralizes
the reconfiguration handling to one function, which takes care of
coordinating buffer size changes and priority map changes to avoid
introducing drops. Second, the fact that the configuration is all in one
place makes it easy to keep a backup and handle error path rollbacks, which
were previously hard to understand.

Patch #1 introduces struct mlxsw_sp_hdroom, which will keep port headroom
configuration.

Patch #2 unifies handling of delay provision between PFC and PAUSE. From
now on, delay is to be measured in bytes of extra space, and will not
include MTU. PFC handler sets the delay directly from the parameter it gets
through the DCB interface. For PAUSE, MLXSW_SP_PAUSE_DELAY is converted to
have the same meaning.

In patches #3-#5, MTU, lossiness and priorities are gradually moved over to
struct mlxsw_sp_hdroom.

In patches #6-#11, handling of buffer resizing and priority maps is moved
from spectrum.c and spectrum_dcb.c to spectrum_buffers.c. The API is
gradually adapted so that struct mlxsw_sp_hdroom becomes the main interface
through which the various clients express how the headroom should be
configured.

Patch #12 is a small cleanup that the previous transformation made
possible.

In patch #13, the port init code becomes a boring client of the headroom
code, instead of rolling its own thing.

Patches #14 and #15 move handling of internal mirroring buffer to the new
headroom code as well. Previously, this code was in the SPAN module. This
patchset converts the SPAN module to another boring client of the headroom
code.
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Nov 16, 2020
Ido Schimmel says:

====================
nexthop: Add support for nexthop objects offload

This patch set adds support for nexthop objects offload with a dummy
implementation over netdevsim. mlxsw support will be added later.

The general idea is very similar to route offload in that notifications
are sent whenever nexthop objects are changed. A listener can veto the
change and the error will be communicated to user space with extack.

To keep listeners as simple as possible, they not only receive
notifications for the nexthop object that is changed, but also for all
the other objects affected by this change. For example, when a single
nexthop is replaced, a replace notification is sent for the single
nexthop, but also for all the nexthop groups this nexthop is member in.
This relieves listeners from the need to track such dependencies.

To simplify things further for listeners, the notification info does not
contain the raw nexthop data structures (e.g., 'struct nexthop'), but
less complex data structures into which the raw data structures are
parsed into.

Tested with a new selftest over netdevsim and with fib_nexthops.sh:

Tests passed: 164
Tests failed:   0

Patch set overview:

Patches #1-#4 introduce the aforementioned data structures and convert
existing listeners (i.e., the VXLAN driver) to use them.

Patches #5-#6 add a new RTNH_F_TRAP flag and the ability to set it and
RTNH_F_OFFLOAD on nexthops. This flag is used by netdevsim for testing
purposes and will also be used by mlxsw. These flags are consistent with
the existing RTM_F_OFFLOAD and RTM_F_TRAP flags.

Patches #7-#14 gradually add the new nexthop notifications.

Patches #15-#18 add a dummy implementation for nexthop offload over
netdevsim and a selftest to exercise both good and bad flows.

Changes since RFC [1]:

Patch #1: s/is_encap/has_encap/
Patch #3: Add a blank line in __nh_notifier_single_info_init()
Patch #5: Reword commit message
Patch #6: s/nexthop_hw_flags_set/nexthop_set_hw_flags/
Patch #7: Reword commit message
Patch #11: Allocate extack on the stack

Follow-up patch sets:

selftests: forwarding: Add nexthop objects tests
mlxsw: Preparations for nexthop objects support - part 1/2
mlxsw: Preparations for nexthop objects support - part 2/2
mlxsw: Add support for nexthop objects
mlxsw: Add support for blackhole nexthops
mlxsw: Update adjacency index more efficiently

[1] https://lore.kernel.org/netdev/[email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Nov 20, 2020
This fix is for a failure that occurred in the DWARF unwind perf test.

Stack unwinders may probe memory when looking for frames.

Memory sanitizer will poison and track uninitialized memory on the
stack, and on the heap if the value is copied to the heap.

This can lead to false memory sanitizer failures for the use of an
uninitialized value.

Avoid this problem by removing the poison on the copied stack.

The full msan failure with track origins looks like:

==2168==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x559ceb10755b in handle_cfi elfutils/libdwfl/frame_unwind.c:648:8
    #1 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4
    #2 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7
    #3 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10
    #4 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17
    #5 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17
    #6 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14
    #7 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10
    #8 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8
    #9 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8
    #10 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
    #11 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
    #12 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
    #13 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
    #14 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
    #15 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
    #16 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
    #17 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
    #18 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
    #19 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
    #20 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
    #21 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
    #22 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
    #23 0x559cea95fbce in main tools/perf/perf.c:539:3

  Uninitialized value was stored to memory at
    #0 0x559ceb106acf in __libdwfl_frame_reg_set elfutils/libdwfl/frame_unwind.c:77:22
    #1 0x559ceb106acf in handle_cfi elfutils/libdwfl/frame_unwind.c:627:13
    #2 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4
    #3 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7
    #4 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10
    #5 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17
    #6 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17
    #7 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14
    #8 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10
    #9 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8
    #10 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8
    #11 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
    #12 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
    #13 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
    #14 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
    #15 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
    #16 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
    #17 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
    #18 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
    #19 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
    #20 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
    #21 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
    #22 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
    #23 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
    #24 0x559cea95fbce in main tools/perf/perf.c:539:3

  Uninitialized value was stored to memory at
    #0 0x559ceb106a54 in handle_cfi elfutils/libdwfl/frame_unwind.c:613:9
    #1 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4
    #2 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7
    #3 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10
    #4 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17
    #5 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17
    #6 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14
    #7 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10
    #8 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8
    #9 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8
    #10 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
    #11 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
    #12 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
    #13 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
    #14 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
    #15 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
    #16 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
    #17 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
    #18 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
    #19 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
    #20 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
    #21 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
    #22 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
    #23 0x559cea95fbce in main tools/perf/perf.c:539:3

  Uninitialized value was stored to memory at
    #0 0x559ceaff8800 in memory_read tools/perf/util/unwind-libdw.c:156:10
    #1 0x559ceb10f053 in expr_eval elfutils/libdwfl/frame_unwind.c:501:13
    #2 0x559ceb1060cc in handle_cfi elfutils/libdwfl/frame_unwind.c:603:18
    #3 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4
    #4 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7
    #5 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10
    #6 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17
    #7 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17
    #8 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14
    #9 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10
    #10 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8
    #11 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8
    #12 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
    #13 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
    #14 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
    #15 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
    #16 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
    #17 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
    #18 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
    #19 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
    #20 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
    #21 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
    #22 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
    #23 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
    #24 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
    #25 0x559cea95fbce in main tools/perf/perf.c:539:3

  Uninitialized value was stored to memory at
    #0 0x559cea9027d9 in __msan_memcpy llvm/llvm-project/compiler-rt/lib/msan/msan_interceptors.cpp:1558:3
    #1 0x559cea9d2185 in sample_ustack tools/perf/arch/x86/tests/dwarf-unwind.c:41:2
    #2 0x559cea9d202c in test__arch_unwind_sample tools/perf/arch/x86/tests/dwarf-unwind.c:72:9
    #3 0x559ceabc9cbd in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:106:6
    #4 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
    #5 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
    #6 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
    #7 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
    #8 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
    #9 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
    #10 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
    #11 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
    #12 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
    #13 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
    #14 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
    #15 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
    #16 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
    #17 0x559cea95fbce in main tools/perf/perf.c:539:3

  Uninitialized value was created by an allocation of 'bf' in the stack frame of function 'perf_event__synthesize_mmap_events'
    #0 0x559ceafc5f60 in perf_event__synthesize_mmap_events tools/perf/util/synthetic-events.c:445

SUMMARY: MemorySanitizer: use-of-uninitialized-value elfutils/libdwfl/frame_unwind.c:648:8 in handle_cfi
Signed-off-by: Ian Rogers <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: [email protected]
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sandeep Dasgupta <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Dec 4, 2020
This fixes possible crash scenario where interfaces that were not
set up in the driver yet might still be iterated over.  When originally
debugged on the ath10k-ct driver, the crash looked like this:

kernel BUG at /home/greearb/git/linux-4.7.dev.y/drivers/net/wireless/ath/ath10k/wmi.c:1781!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 bridge carl9170 mac80211_hwsim ath10k_pci ath10k_core ath5k ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 8021q garp mrp stp llc bnep bluetooth fuse macvlan pktgen rpcsec_gss_krb5 nfsv4 nfs fscache snd_hda_codec_hdmi coretemp hwmon intel_rapl x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek snd_hda_codec_generic kvm iTCO_wdt irqbypass iTCO_vendor_support joydev snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device pcspkr snd_pcm snd_timer shpchp snd i2c_i801 lpc_ich soundcore tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915 serio_raw i2c_algo_bit drm_kms_helper ata_generic e1000e pata_acpi drm ptp pps_core i2c_core fjes video ipv6 [last unloaded: nf_conntrack]
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.10+ #15
Hardware name: To be filled by O.E.M. To be filled by O.E.M./ChiefRiver, BIOS 4.6.5 06/07/2013
task: ffff8801d4f20000 ti: ffff8801d4f28000 task.ti: ffff8801d4f28000
RIP: 0010:[<ffffffffa0efbcfb>]  [<ffffffffa0efbcfb>] ath10k_wmi_tx_beacons_iter+0x28b/0x290 [ath10k_core]
RSP: 0018:ffff8801d6447a98  EFLAGS: 00010293
RAX: 0000000000000018 RBX: ffff8801ce97e1d8 RCX: 0000000000000000
RDX: 0000000000000018 RSI: 0000000000000003 RDI: ffffed003ac88f49
RBP: ffff8801d6447af0 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: ffff8801ce97e320 R14: ffff8801ce97e378 R15: ffff8801ce97ca40
FS:  0000000000000000(0000) GS:ffff8801d6440000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007eff191ef1ab CR3: 000000000260a000 CR4: 00000000001406e0
Stack:
 1ffff1003ac88f59 0000000041b58ab3 ffffffffa0f4d52a ffff8801d4f20000
 0000000000000246 0000000000000002 ffff8801ce97e1d8 ffff8801bd5d39b8
 0000000000000002 0000000000000001 ffff8801ce97ca40 ffff8801d6447b48
Call Trace:
 <IRQ>
 [<ffffffffa0d03e5c>] __iterate_interfaces+0xfc/0x1d0 [mac80211]
 [<ffffffffa0efba70>] ? ath10k_wmi_cmd_send_nowait+0x260/0x260 [ath10k_core]
 [<ffffffffa0efba70>] ? ath10k_wmi_cmd_send_nowait+0x260/0x260 [ath10k_core]
 [<ffffffffa0d04477>] ieee80211_iterate_active_interfaces_atomic+0x67/0x100 [mac80211]
 [<ffffffffa0d04410>] ? ieee80211_handle_reconfig_failure+0x140/0x140 [mac80211]
 [<ffffffffa0ef4060>] ? ath10k_tpc_config_disp_tables+0x620/0x620 [ath10k_core]
 [<ffffffffa0ef408b>] ath10k_wmi_op_ep_tx_credits+0x2b/0x50 [ath10k_core]
 [<ffffffffa0ee2fd2>] ath10k_htc_rx_completion_handler+0x422/0x5c0 [ath10k_core]
 [<ffffffffa0b4301e>] ath10k_pci_process_rx_cb+0x37e/0x430 [ath10k_pci]
 [<ffffffffa0ee2bb0>] ? ath10k_htc_build_tx_ctrl_skb+0xc0/0xc0 [ath10k_core]
 [<ffffffffa0b42ca0>] ? ath10k_pci_rx_post_pipe+0x550/0x550 [ath10k_pci]
 [<ffffffff8120cbe5>] ? debug_lockdep_rcu_enabled+0x35/0x40
 [<ffffffff811e1893>] ? mark_held_locks+0x23/0xc0
 [<ffffffff8116019a>] ? __local_bh_enable_ip+0x6a/0xd0
 [<ffffffff811e1abb>] ? trace_hardirqs_on_caller+0x18b/0x290
 [<ffffffff811e1bcd>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff8116019a>] ? __local_bh_enable_ip+0x6a/0xd0
 [<ffffffff81df11d0>] ? _raw_spin_unlock_bh+0x30/0x40
 [<ffffffffa0b4902e>] ? ath10k_ce_per_engine_service+0xee/0x100 [ath10k_pci]
 [<ffffffffa0b43139>] ath10k_pci_htt_htc_rx_cb+0x29/0x30 [ath10k_pci]
 [<ffffffffa0b48fe6>] ath10k_ce_per_engine_service+0xa6/0x100 [ath10k_pci]
 [<ffffffffa0b49116>] ath10k_ce_per_engine_service_any+0xd6/0xf0 [ath10k_pci]
 [<ffffffffa0b45800>] ? ath10k_pci_enable_legacy_irq+0xe0/0xe0 [ath10k_pci]
 [<ffffffffa0b4585f>] ath10k_pci_tasklet+0x5f/0xb0 [ath10k_pci]
 [<ffffffff81160445>] tasklet_action+0x245/0x2b0
 [<ffffffff81df4831>] __do_softirq+0x181/0x595
 [<ffffffff8116137c>] irq_exit+0xbc/0xc0
 [<ffffffff81df423c>] do_IRQ+0x7c/0x150
 [<ffffffff81df23cc>] common_interrupt+0x8c/0x8c
 <EOI>
 [<ffffffff811e1abb>] ? trace_hardirqs_on_caller+0x18b/0x290
 [<ffffffff81b722ae>] ? cpuidle_enter_state+0x1ae/0x4b0
 [<ffffffff81b722a7>] ? cpuidle_enter_state+0x1a7/0x4b0
 [<ffffffff81b72602>] cpuidle_enter+0x12/0x20
 [<ffffffff811d0b6e>] call_cpuidle+0x4e/0x90
 [<ffffffff811d10e7>] cpu_startup_entry+0x3f7/0x540
 [<ffffffff811d0cf0>] ? default_idle_call+0x50/0x50
 [<ffffffff81234bdf>] ? clockevents_config_and_register+0x5f/0x70
 [<ffffffff81085a9a>] ? setup_APIC_timer+0xfa/0x110
 [<ffffffff81083b63>] start_secondary+0x253/0x2b0
 [<ffffffff81083910>] ? set_cpu_sibling_map+0x920/0x920
Code: 4d 49 e0 8b b3 48 01 00 00 48 c7 c7 a0 ee f3 a0 e8 d9 c2 3f e0 49 81 fd 3f 1f 00 00 76 0f 49 81 fc 3f 1f 00 00 0f 87 c0 fd ff ff <0f> 0b 0f 0b 90 55 48 89 e5 41 57 41 56 48 8d 85 58 ff ff ff 41
RIP  [<ffffffffa0efbcfb>] ath10k_wmi_tx_beacons_iter+0x28b/0x290 [ath10k_core]
 RSP <ffff8801d6447a98>
---[ end trace 6588464714e5163a ]---

Similar logic was tested for years in ath10k-ct driver and various firmware.

Also tested with stock kernel plus this patch, with firmware
10.2.4-1.0-00037

This test case was to bring up 5 vap on a radio and fake a firmware
crash.  Make sure ap interfaces continue to function properly.

Signed-off-by: Ben Greear <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
kernel-patches-bot pushed a commit that referenced this pull request Dec 15, 2020
Ido Schimmel says:

====================
mlxsw: Introduce initial XM router support

This patch set implements initial eXtended Mezzanine (XM) router
support.

The XM is an external device connected to the Spectrum-{2,3} ASICs using
dedicated Ethernet ports. Its purpose is to increase the number of
routes that can be offloaded to hardware. This is achieved by having the
ASIC act as a cache that refers cache misses to the XM where the FIB is
stored and LPM lookup is performed.

Future patch sets will add more sophisticated cache flushing and
selftests that utilize cache counters on the ASIC, which we plan to
expose via devlink-metric [1].

Patch set overview:

Patches #1-#2 add registers to insert/remove routes to/from the XM and
to enable/disable it. Patch #3 utilizes these registers in order to
implement XM-specific router low-level operations.

Patches #4-#5 query from firmware the availability of the XM and the
local ports that are used to connect the ASIC to the XM, so that netdevs
will not be created for them.

Patches #6-#8 initialize the XM by configuring its cache parameters.

Patch #9-#10 implement cache management, so that LPM lookup will be
correctly cached in the ASIC.

Patches #11-#13 implement cache flushing, so that routes
insertions/removals to/from the XM will flush the affected entries in
the cache.

Patch #14 configures the ASIC to allocate half of its memory for the
cache, so that room will be left for other entries (e.g., FDBs,
neighbours).

Patch #15 starts using the XM for IPv4 route offload, when available.

[1] https://lore.kernel.org/netdev/[email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 10, 2021
Calling btrfs_qgroup_reserve_meta_prealloc from
btrfs_delayed_inode_reserve_metadata can result in flushing delalloc
while holding a transaction and delayed node locks. This is deadlock
prone. In the past multiple commits:

 * ae5e070 ("btrfs: qgroup: don't try to wait flushing if we're
already holding a transaction")

 * 6f23277 ("btrfs: qgroup: don't commit transaction when we already
 hold the handle")

Tried to solve various aspects of this but this was always a
whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock
scenario involving btrfs_delayed_node::mutex. Namely, one thread
can call btrfs_dirty_inode as a result of reading a file and modifying
its atime:

  PID: 6963   TASK: ffff8c7f3f94c000  CPU: 2   COMMAND: "test"
  #0  __schedule at ffffffffa529e07d
  #1  schedule at ffffffffa529e4ff
  #2  schedule_timeout at ffffffffa52a1bdd
  #3  wait_for_completion at ffffffffa529eeea             <-- sleeps with delayed node mutex held
  #4  start_delalloc_inodes at ffffffffc0380db5
  #5  btrfs_start_delalloc_snapshot at ffffffffc0393836
  #6  try_flush_qgroup at ffffffffc03f04b2
  #7  __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6     <-- tries to reserve space and starts delalloc inodes.
  #8  btrfs_delayed_update_inode at ffffffffc03e31aa      <-- acquires delayed node mutex
  #9  btrfs_update_inode at ffffffffc0385ba8
 #10  btrfs_dirty_inode at ffffffffc038627b               <-- TRANSACTIION OPENED
 #11  touch_atime at ffffffffa4cf0000
 #12  generic_file_read_iter at ffffffffa4c1f123
 #13  new_sync_read at ffffffffa4ccdc8a
 #14  vfs_read at ffffffffa4cd0849
 #15  ksys_read at ffffffffa4cd0bd1
 #16  do_syscall_64 at ffffffffa4a052eb
 #17  entry_SYSCALL_64_after_hwframe at ffffffffa540008c

This will cause an asynchronous work to flush the delalloc inodes to
happen which can try to acquire the same delayed_node mutex:

  PID: 455    TASK: ffff8c8085fa4000  CPU: 5   COMMAND: "kworker/u16:30"
  #0  __schedule at ffffffffa529e07d
  #1  schedule at ffffffffa529e4ff
  #2  schedule_preempt_disabled at ffffffffa529e80a
  #3  __mutex_lock at ffffffffa529fdcb                    <-- goes to sleep, never wakes up.
  #4  btrfs_delayed_update_inode at ffffffffc03e3143      <-- tries to acquire the mutex
  #5  btrfs_update_inode at ffffffffc0385ba8              <-- this is the same inode that pid 6963 is holding
  #6  cow_file_range_inline.constprop.78 at ffffffffc0386be7
  #7  cow_file_range at ffffffffc03879c1
  #8  btrfs_run_delalloc_range at ffffffffc038894c
  #9  writepage_delalloc at ffffffffc03a3c8f
 #10  __extent_writepage at ffffffffc03a4c01
 #11  extent_write_cache_pages at ffffffffc03a500b
 #12  extent_writepages at ffffffffc03a6de2
 #13  do_writepages at ffffffffa4c277eb
 #14  __filemap_fdatawrite_range at ffffffffa4c1e5bb
 #15  btrfs_run_delalloc_work at ffffffffc0380987         <-- starts running delayed nodes
 #16  normal_work_helper at ffffffffc03b706c
 #17  process_one_work at ffffffffa4aba4e4
 #18  worker_thread at ffffffffa4aba6fd
 #19  kthread at ffffffffa4ac0a3d
 #20  ret_from_fork at ffffffffa54001ff

To fully address those cases the complete fix is to never issue any
flushing while holding the transaction or the delayed node lock. This
patch achieves it by calling qgroup_reserve_meta directly which will
either succeed without flushing or will fail and return -EDQUOT. In the
latter case that return value is going to be propagated to
btrfs_dirty_inode which will fallback to start a new transaction. That's
fine as the majority of time we expect the inode will have
BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly
copying the in-memory state.

Fixes: c53e965 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: [email protected] # 5.10+
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: David Sterba <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 10, 2021
The evlist and the cpu/thread maps should be released together.
Otherwise following error was reported by Asan.

Note that this test still has memory leaks in DSOs so it still fails
even after this change.  I'll take a look at that too.

  # perf test -v 26
  26: Object code reading                        :
  --- start ---
  test child forked, pid 154184
  Looking at the vmlinux_path (8 entries long)
  symsrc__init: build id mismatch for vmlinux.
  symsrc__init: cannot get elf header.
  Using /proc/kcore for kernel data
  Using /proc/kallsyms for symbols
  Parsing event 'cycles'
  mmap size 528384B
  ...
  =================================================================
  ==154184==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 439 byte(s) in 1 object(s) allocated from:
    #0 0x7fcb66e77037 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x55ad9b7e821e in dso__new_id util/dso.c:1256
    #2 0x55ad9b8cfd4a in __machine__addnew_vdso util/vdso.c:132
    #3 0x55ad9b8cfd4a in machine__findnew_vdso util/vdso.c:347
    #4 0x55ad9b845b7e in map__new util/map.c:176
    #5 0x55ad9b8415a2 in machine__process_mmap2_event util/machine.c:1787
    #6 0x55ad9b8fab16 in perf_tool__process_synth_event util/synthetic-events.c:64
    #7 0x55ad9b8fab16 in perf_event__synthesize_mmap_events util/synthetic-events.c:499
    #8 0x55ad9b8fbfdf in __event__synthesize_thread util/synthetic-events.c:741
    #9 0x55ad9b8ff3e3 in perf_event__synthesize_thread_map util/synthetic-events.c:833
    #10 0x55ad9b738585 in do_test_code_reading tests/code-reading.c:608
    #11 0x55ad9b73b25d in test__code_reading tests/code-reading.c:722
    #12 0x55ad9b6f28fb in run_test tests/builtin-test.c:428
    #13 0x55ad9b6f28fb in test_and_print tests/builtin-test.c:458
    #14 0x55ad9b6f4a53 in __cmd_test tests/builtin-test.c:679
    #15 0x55ad9b6f4a53 in cmd_test tests/builtin-test.c:825
    #16 0x55ad9b760cc4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313
    #17 0x55ad9b5eaa88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365
    #18 0x55ad9b5eaa88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409
    #19 0x55ad9b5eaa88 in main /home/namhyung/project/linux/tools/perf/perf.c:539
    #20 0x7fcb669acd09 in __libc_start_main ../csu/libc-start.c:308

    ...
  SUMMARY: AddressSanitizer: 471 byte(s) leaked in 2 allocation(s).
  test child finished with 1
  ---- end ----
  Object code reading: FAILED!

Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 26, 2021
Pablo Neira Ayuso says:

====================
netfilter: flowtable enhancements

[ This is v2 that includes documentation enhancements, including
  existing limitations. This is a rebase on top on net-next. ]

The following patchset augments the Netfilter flowtable fastpath to
support for network topologies that combine IP forwarding, bridge,
classic VLAN devices, bridge VLAN filtering, DSA and PPPoE. This
includes support for the flowtable software and hardware datapaths.

The following pictures provides an example scenario:

                        fast path!
                .------------------------.
               /                          \
               |           IP forwarding  |
               |          /             \ \/
               |       br0               wan ..... eth0
               .       / \                         host C
               -> veth1  veth2
                   .           switch/router
                   .
                   .
                 eth0
                host A

The bridge master device 'br0' has an IP address and a DHCP server is
also assumed to be running to provide connectivity to host A which
reaches the Internet through 'br0' as default gateway. Then, packet
enters the IP forwarding path and Netfilter is used to NAT the packets
before they leave through the wan device.

The general idea is to accelerate forwarding by building a fast path
that takes packets from the ingress path of the bridge port and place
them in the egress path of the wan device (and vice versa). Hence,
skipping the classic bridge and IP stack paths.

** Patch from #1 to #6 add the infrastructure which describes the list of
   netdevice hops to reach a given destination MAC address in the local
   network topology.

Patch #1 adds dev_fill_forward_path() and .ndo_fill_forward_path() to
         netdev_ops.

Patch #2 adds .ndo_fill_forward_path for vlan devices, which provides
         the next device hop via vlan->real_dev, the vlan ID and the
         protocol.

Patch #3 adds .ndo_fill_forward_path for bridge devices, which allows to make
         lookups to the FDB to locate the next device hop (bridge port) in the
         forwarding path.

Patch #4 extends bridge .ndo_fill_forward_path to support for bridge VLAN
         filtering.

Patch #5 adds .ndo_fill_forward_path for PPPoE devices.

Patch #6 adds .ndo_fill_forward_path for DSA.

Patches from #7 to #14 update the flowtable software datapath:

Patch #7 adds the transmit path type field to the flow tuple. Two transmit
         paths are supported so far: the neighbour and the xfrm transmit
         paths.

Patch #8 and #9 update the flowtable datapath to use dev_fill_forward_path()
         to obtain the real ingress/egress device for the flowtable datapath.
         This adds the new ethernet xmit direct path to the flowtable.

Patch #10 adds native flowtable VLAN support (up to 2 VLAN tags) through
          dev_fill_forward_path(). The flowtable stores the VLAN id and
          protocol in the flow tuple.

Patch #11 adds native flowtable bridge VLAN filter support through
          dev_fill_forward_path().

Patch #12 adds native flowtable bridge PPPoE through dev_fill_forward_path().

Patch #13 adds DSA support through dev_fill_forward_path().

Patch #14 extends flowtable selftests to cover for flowtable software
          datapath enhancements.

** Patches from #15 to #20 update the flowtable hardware offload datapath:

Patch #15 extends the flowtable hardware offload to support for the
          direct ethernet xmit path. This also includes VLAN support.

Patch #16 stores the egress real device in the flow tuple. The software
          flowtable datapath uses dev_hard_header() to transmit packets,
          hence it might refer to VLAN/DSA/PPPoE software device, not
          the real ethernet device.

Patch #17 deals with switchdev PVID hardware offload to skip it on
          egress.

Patch #18 adds FLOW_ACTION_PPPOE_PUSH to the flow_offload action API.

Patch #19 extends the flowtable hardware offload to support for PPPoE

Patch #20 adds TC_SETUP_FT support for DSA.

** Patches from #20 to #23: Felix Fietkau adds a new driver which support
   hardware offload for the mtk PPE engine through the existing flow
   offload API which supports for the flowtable enhancements coming in
   this batch.

Patch #24 extends the documentation and describe existing limitations.

Please, apply, thanks.
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Apr 16, 2021
While removing a qgroup's sysfs entry we end up taking the kernfs_mutex,
through kobject_del(), while holding the fs_info->qgroup_lock spinlock,
producing the following trace:

  [821.843637] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:281
  [821.843641] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 28214, name: podman
  [821.843644] CPU: 3 PID: 28214 Comm: podman Tainted: G        W         5.11.6 #15
  [821.843646] Hardware name: Dell Inc. PowerEdge R330/084XW4, BIOS 2.11.0 12/08/2020
  [821.843647] Call Trace:
  [821.843650]  dump_stack+0xa1/0xfb
  [821.843656]  ___might_sleep+0x144/0x160
  [821.843659]  mutex_lock+0x17/0x40
  [821.843662]  kernfs_remove_by_name_ns+0x1f/0x80
  [821.843666]  sysfs_remove_group+0x7d/0xe0
  [821.843668]  sysfs_remove_groups+0x28/0x40
  [821.843670]  kobject_del+0x2a/0x80
  [821.843672]  btrfs_sysfs_del_one_qgroup+0x2b/0x40 [btrfs]
  [821.843685]  __del_qgroup_rb+0x12/0x150 [btrfs]
  [821.843696]  btrfs_remove_qgroup+0x288/0x2a0 [btrfs]
  [821.843707]  btrfs_ioctl+0x3129/0x36a0 [btrfs]
  [821.843717]  ? __mod_lruvec_page_state+0x5e/0xb0
  [821.843719]  ? page_add_new_anon_rmap+0xbc/0x150
  [821.843723]  ? kfree+0x1b4/0x300
  [821.843725]  ? mntput_no_expire+0x55/0x330
  [821.843728]  __x64_sys_ioctl+0x5a/0xa0
  [821.843731]  do_syscall_64+0x33/0x70
  [821.843733]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [821.843736] RIP: 0033:0x4cd3fb
  [821.843741] RSP: 002b:000000c000906b20 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
  [821.843744] RAX: ffffffffffffffda RBX: 000000c000050000 RCX: 00000000004cd3fb
  [821.843745] RDX: 000000c000906b98 RSI: 000000004010942a RDI: 000000000000000f
  [821.843747] RBP: 000000c000907cd0 R08: 000000c000622901 R09: 0000000000000000
  [821.843748] R10: 000000c000d992c0 R11: 0000000000000206 R12: 000000000000012d
  [821.843749] R13: 000000000000012c R14: 0000000000000200 R15: 0000000000000049

Fix this by removing the qgroup sysfs entry while not holding the spinlock,
since the spinlock is only meant for protection of the qgroup rbtree.

Reported-by: Stuart Shelton <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Fixes: 49e5fb4 ("btrfs: qgroup: export qgroups in sysfs")
CC: [email protected] # 5.10+
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Feb 24, 2025
The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf,
resulting in a BUG like seen by Stan:

[    1.548495] BUG: sleeping function called from invalid context at drivers/dma-buf/dma-buf.c:1255
[    1.548741] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 149, name: ncdevmem
[    1.548926] preempt_count: 201, expected: 0
[    1.549026] RCU nest depth: 0, expected: 0
[    1.549197]
[    1.549237] =============================
[    1.549331] [ BUG: Invalid wait context ]
[    1.549425] 6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15 Tainted: G        W
[    1.549609] -----------------------------
[    1.549704] ncdevmem/149 is trying to lock:
[    1.549801] ffff8880066701c0 (reservation_ww_class_mutex){+.+.}-{4:4}, at: dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.550051] other info that might help us debug this:
[    1.550167] context-{5:5}
[    1.550229] 3 locks held by ncdevmem/149:
[    1.550322]  #0: ffff888005730208 (&sb->s_type->i_mutex_key#11){+.+.}-{4:4}, at: sock_close+0x40/0xf0
[    1.550530]  kernel-patches#1: ffff88800b148f98 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x19/0x80
[    1.550731]  kernel-patches#2: ffff88800b148f18 (slock-AF_INET6){+.-.}-{3:3}, at: __tcp_close+0x185/0x4b0
[    1.550921] stack backtrace:
[    1.550990] CPU: 0 UID: 0 PID: 149 Comm: ncdevmem Tainted: G        W          6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15
[    1.551233] Tainted: [W]=WARN
[    1.551304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    1.551518] Call Trace:
[    1.551584]  <TASK>
[    1.551636]  dump_stack_lvl+0x86/0xc0
[    1.551723]  __lock_acquire+0xb0f/0xc30
[    1.551814]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.551941]  lock_acquire+0xf1/0x2a0
[    1.552026]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552152]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552281]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552408]  __ww_mutex_lock+0x121/0x1060
[    1.552503]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552648]  ww_mutex_lock+0x3d/0xa0
[    1.552733]  dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552857]  __net_devmem_dmabuf_binding_free+0x56/0xb0
[    1.552979]  skb_release_data+0x120/0x1f0
[    1.553074]  __kfree_skb+0x29/0xa0
[    1.553156]  tcp_write_queue_purge+0x41/0x310
[    1.553259]  tcp_v4_destroy_sock+0x127/0x320
[    1.553363]  ? __tcp_close+0x169/0x4b0
[    1.553452]  inet_csk_destroy_sock+0x53/0x130
[    1.553560]  __tcp_close+0x421/0x4b0
[    1.553646]  tcp_close+0x24/0x80
[    1.553724]  inet_release+0x5d/0x90
[    1.553806]  sock_close+0x4a/0xf0
[    1.553886]  __fput+0x9c/0x2b0
[    1.553960]  task_work_run+0x89/0xc0
[    1.554046]  do_exit+0x27f/0x980
[    1.554125]  do_group_exit+0xa4/0xb0
[    1.554211]  __x64_sys_exit_group+0x17/0x20
[    1.554309]  x64_sys_call+0x21a0/0x21a0
[    1.554400]  do_syscall_64+0xec/0x1d0
[    1.554487]  ? exc_page_fault+0x8a/0xf0
[    1.554585]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    1.554703] RIP: 0033:0x7f2f8a27abcd

Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Feb 24, 2025
The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf,
resulting in a BUG like seen by Stan:

[    1.548495] BUG: sleeping function called from invalid context at drivers/dma-buf/dma-buf.c:1255
[    1.548741] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 149, name: ncdevmem
[    1.548926] preempt_count: 201, expected: 0
[    1.549026] RCU nest depth: 0, expected: 0
[    1.549197]
[    1.549237] =============================
[    1.549331] [ BUG: Invalid wait context ]
[    1.549425] 6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15 Tainted: G        W
[    1.549609] -----------------------------
[    1.549704] ncdevmem/149 is trying to lock:
[    1.549801] ffff8880066701c0 (reservation_ww_class_mutex){+.+.}-{4:4}, at: dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.550051] other info that might help us debug this:
[    1.550167] context-{5:5}
[    1.550229] 3 locks held by ncdevmem/149:
[    1.550322]  #0: ffff888005730208 (&sb->s_type->i_mutex_key#11){+.+.}-{4:4}, at: sock_close+0x40/0xf0
[    1.550530]  kernel-patches#1: ffff88800b148f98 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x19/0x80
[    1.550731]  kernel-patches#2: ffff88800b148f18 (slock-AF_INET6){+.-.}-{3:3}, at: __tcp_close+0x185/0x4b0
[    1.550921] stack backtrace:
[    1.550990] CPU: 0 UID: 0 PID: 149 Comm: ncdevmem Tainted: G        W          6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15
[    1.551233] Tainted: [W]=WARN
[    1.551304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    1.551518] Call Trace:
[    1.551584]  <TASK>
[    1.551636]  dump_stack_lvl+0x86/0xc0
[    1.551723]  __lock_acquire+0xb0f/0xc30
[    1.551814]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.551941]  lock_acquire+0xf1/0x2a0
[    1.552026]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552152]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552281]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552408]  __ww_mutex_lock+0x121/0x1060
[    1.552503]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552648]  ww_mutex_lock+0x3d/0xa0
[    1.552733]  dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552857]  __net_devmem_dmabuf_binding_free+0x56/0xb0
[    1.552979]  skb_release_data+0x120/0x1f0
[    1.553074]  __kfree_skb+0x29/0xa0
[    1.553156]  tcp_write_queue_purge+0x41/0x310
[    1.553259]  tcp_v4_destroy_sock+0x127/0x320
[    1.553363]  ? __tcp_close+0x169/0x4b0
[    1.553452]  inet_csk_destroy_sock+0x53/0x130
[    1.553560]  __tcp_close+0x421/0x4b0
[    1.553646]  tcp_close+0x24/0x80
[    1.553724]  inet_release+0x5d/0x90
[    1.553806]  sock_close+0x4a/0xf0
[    1.553886]  __fput+0x9c/0x2b0
[    1.553960]  task_work_run+0x89/0xc0
[    1.554046]  do_exit+0x27f/0x980
[    1.554125]  do_group_exit+0xa4/0xb0
[    1.554211]  __x64_sys_exit_group+0x17/0x20
[    1.554309]  x64_sys_call+0x21a0/0x21a0
[    1.554400]  do_syscall_64+0xec/0x1d0
[    1.554487]  ? exc_page_fault+0x8a/0xf0
[    1.554585]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    1.554703] RIP: 0033:0x7f2f8a27abcd

Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Feb 24, 2025
The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf,
resulting in a BUG like seen by Stan:

[    1.548495] BUG: sleeping function called from invalid context at drivers/dma-buf/dma-buf.c:1255
[    1.548741] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 149, name: ncdevmem
[    1.548926] preempt_count: 201, expected: 0
[    1.549026] RCU nest depth: 0, expected: 0
[    1.549197]
[    1.549237] =============================
[    1.549331] [ BUG: Invalid wait context ]
[    1.549425] 6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15 Tainted: G        W
[    1.549609] -----------------------------
[    1.549704] ncdevmem/149 is trying to lock:
[    1.549801] ffff8880066701c0 (reservation_ww_class_mutex){+.+.}-{4:4}, at: dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.550051] other info that might help us debug this:
[    1.550167] context-{5:5}
[    1.550229] 3 locks held by ncdevmem/149:
[    1.550322]  #0: ffff888005730208 (&sb->s_type->i_mutex_key#11){+.+.}-{4:4}, at: sock_close+0x40/0xf0
[    1.550530]  kernel-patches#1: ffff88800b148f98 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x19/0x80
[    1.550731]  kernel-patches#2: ffff88800b148f18 (slock-AF_INET6){+.-.}-{3:3}, at: __tcp_close+0x185/0x4b0
[    1.550921] stack backtrace:
[    1.550990] CPU: 0 UID: 0 PID: 149 Comm: ncdevmem Tainted: G        W          6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15
[    1.551233] Tainted: [W]=WARN
[    1.551304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    1.551518] Call Trace:
[    1.551584]  <TASK>
[    1.551636]  dump_stack_lvl+0x86/0xc0
[    1.551723]  __lock_acquire+0xb0f/0xc30
[    1.551814]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.551941]  lock_acquire+0xf1/0x2a0
[    1.552026]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552152]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552281]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552408]  __ww_mutex_lock+0x121/0x1060
[    1.552503]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552648]  ww_mutex_lock+0x3d/0xa0
[    1.552733]  dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552857]  __net_devmem_dmabuf_binding_free+0x56/0xb0
[    1.552979]  skb_release_data+0x120/0x1f0
[    1.553074]  __kfree_skb+0x29/0xa0
[    1.553156]  tcp_write_queue_purge+0x41/0x310
[    1.553259]  tcp_v4_destroy_sock+0x127/0x320
[    1.553363]  ? __tcp_close+0x169/0x4b0
[    1.553452]  inet_csk_destroy_sock+0x53/0x130
[    1.553560]  __tcp_close+0x421/0x4b0
[    1.553646]  tcp_close+0x24/0x80
[    1.553724]  inet_release+0x5d/0x90
[    1.553806]  sock_close+0x4a/0xf0
[    1.553886]  __fput+0x9c/0x2b0
[    1.553960]  task_work_run+0x89/0xc0
[    1.554046]  do_exit+0x27f/0x980
[    1.554125]  do_group_exit+0xa4/0xb0
[    1.554211]  __x64_sys_exit_group+0x17/0x20
[    1.554309]  x64_sys_call+0x21a0/0x21a0
[    1.554400]  do_syscall_64+0xec/0x1d0
[    1.554487]  ? exc_page_fault+0x8a/0xf0
[    1.554585]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    1.554703] RIP: 0033:0x7f2f8a27abcd

Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Feb 24, 2025
The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf,
resulting in a BUG like seen by Stan:

[    1.548495] BUG: sleeping function called from invalid context at drivers/dma-buf/dma-buf.c:1255
[    1.548741] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 149, name: ncdevmem
[    1.548926] preempt_count: 201, expected: 0
[    1.549026] RCU nest depth: 0, expected: 0
[    1.549197]
[    1.549237] =============================
[    1.549331] [ BUG: Invalid wait context ]
[    1.549425] 6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15 Tainted: G        W
[    1.549609] -----------------------------
[    1.549704] ncdevmem/149 is trying to lock:
[    1.549801] ffff8880066701c0 (reservation_ww_class_mutex){+.+.}-{4:4}, at: dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.550051] other info that might help us debug this:
[    1.550167] context-{5:5}
[    1.550229] 3 locks held by ncdevmem/149:
[    1.550322]  #0: ffff888005730208 (&sb->s_type->i_mutex_key#11){+.+.}-{4:4}, at: sock_close+0x40/0xf0
[    1.550530]  kernel-patches#1: ffff88800b148f98 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x19/0x80
[    1.550731]  kernel-patches#2: ffff88800b148f18 (slock-AF_INET6){+.-.}-{3:3}, at: __tcp_close+0x185/0x4b0
[    1.550921] stack backtrace:
[    1.550990] CPU: 0 UID: 0 PID: 149 Comm: ncdevmem Tainted: G        W          6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15
[    1.551233] Tainted: [W]=WARN
[    1.551304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    1.551518] Call Trace:
[    1.551584]  <TASK>
[    1.551636]  dump_stack_lvl+0x86/0xc0
[    1.551723]  __lock_acquire+0xb0f/0xc30
[    1.551814]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.551941]  lock_acquire+0xf1/0x2a0
[    1.552026]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552152]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552281]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552408]  __ww_mutex_lock+0x121/0x1060
[    1.552503]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552648]  ww_mutex_lock+0x3d/0xa0
[    1.552733]  dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552857]  __net_devmem_dmabuf_binding_free+0x56/0xb0
[    1.552979]  skb_release_data+0x120/0x1f0
[    1.553074]  __kfree_skb+0x29/0xa0
[    1.553156]  tcp_write_queue_purge+0x41/0x310
[    1.553259]  tcp_v4_destroy_sock+0x127/0x320
[    1.553363]  ? __tcp_close+0x169/0x4b0
[    1.553452]  inet_csk_destroy_sock+0x53/0x130
[    1.553560]  __tcp_close+0x421/0x4b0
[    1.553646]  tcp_close+0x24/0x80
[    1.553724]  inet_release+0x5d/0x90
[    1.553806]  sock_close+0x4a/0xf0
[    1.553886]  __fput+0x9c/0x2b0
[    1.553960]  task_work_run+0x89/0xc0
[    1.554046]  do_exit+0x27f/0x980
[    1.554125]  do_group_exit+0xa4/0xb0
[    1.554211]  __x64_sys_exit_group+0x17/0x20
[    1.554309]  x64_sys_call+0x21a0/0x21a0
[    1.554400]  do_syscall_64+0xec/0x1d0
[    1.554487]  ? exc_page_fault+0x8a/0xf0
[    1.554585]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    1.554703] RIP: 0033:0x7f2f8a27abcd

Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Feb 24, 2025
The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf,
resulting in a BUG like seen by Stan:

[    1.548495] BUG: sleeping function called from invalid context at drivers/dma-buf/dma-buf.c:1255
[    1.548741] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 149, name: ncdevmem
[    1.548926] preempt_count: 201, expected: 0
[    1.549026] RCU nest depth: 0, expected: 0
[    1.549197]
[    1.549237] =============================
[    1.549331] [ BUG: Invalid wait context ]
[    1.549425] 6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15 Tainted: G        W
[    1.549609] -----------------------------
[    1.549704] ncdevmem/149 is trying to lock:
[    1.549801] ffff8880066701c0 (reservation_ww_class_mutex){+.+.}-{4:4}, at: dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.550051] other info that might help us debug this:
[    1.550167] context-{5:5}
[    1.550229] 3 locks held by ncdevmem/149:
[    1.550322]  #0: ffff888005730208 (&sb->s_type->i_mutex_key#11){+.+.}-{4:4}, at: sock_close+0x40/0xf0
[    1.550530]  kernel-patches#1: ffff88800b148f98 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x19/0x80
[    1.550731]  kernel-patches#2: ffff88800b148f18 (slock-AF_INET6){+.-.}-{3:3}, at: __tcp_close+0x185/0x4b0
[    1.550921] stack backtrace:
[    1.550990] CPU: 0 UID: 0 PID: 149 Comm: ncdevmem Tainted: G        W          6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15
[    1.551233] Tainted: [W]=WARN
[    1.551304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    1.551518] Call Trace:
[    1.551584]  <TASK>
[    1.551636]  dump_stack_lvl+0x86/0xc0
[    1.551723]  __lock_acquire+0xb0f/0xc30
[    1.551814]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.551941]  lock_acquire+0xf1/0x2a0
[    1.552026]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552152]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552281]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552408]  __ww_mutex_lock+0x121/0x1060
[    1.552503]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552648]  ww_mutex_lock+0x3d/0xa0
[    1.552733]  dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552857]  __net_devmem_dmabuf_binding_free+0x56/0xb0
[    1.552979]  skb_release_data+0x120/0x1f0
[    1.553074]  __kfree_skb+0x29/0xa0
[    1.553156]  tcp_write_queue_purge+0x41/0x310
[    1.553259]  tcp_v4_destroy_sock+0x127/0x320
[    1.553363]  ? __tcp_close+0x169/0x4b0
[    1.553452]  inet_csk_destroy_sock+0x53/0x130
[    1.553560]  __tcp_close+0x421/0x4b0
[    1.553646]  tcp_close+0x24/0x80
[    1.553724]  inet_release+0x5d/0x90
[    1.553806]  sock_close+0x4a/0xf0
[    1.553886]  __fput+0x9c/0x2b0
[    1.553960]  task_work_run+0x89/0xc0
[    1.554046]  do_exit+0x27f/0x980
[    1.554125]  do_group_exit+0xa4/0xb0
[    1.554211]  __x64_sys_exit_group+0x17/0x20
[    1.554309]  x64_sys_call+0x21a0/0x21a0
[    1.554400]  do_syscall_64+0xec/0x1d0
[    1.554487]  ? exc_page_fault+0x8a/0xf0
[    1.554585]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    1.554703] RIP: 0033:0x7f2f8a27abcd

Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Feb 25, 2025
The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf,
resulting in a BUG like seen by Stan:

[    1.548495] BUG: sleeping function called from invalid context at drivers/dma-buf/dma-buf.c:1255
[    1.548741] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 149, name: ncdevmem
[    1.548926] preempt_count: 201, expected: 0
[    1.549026] RCU nest depth: 0, expected: 0
[    1.549197]
[    1.549237] =============================
[    1.549331] [ BUG: Invalid wait context ]
[    1.549425] 6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15 Tainted: G        W
[    1.549609] -----------------------------
[    1.549704] ncdevmem/149 is trying to lock:
[    1.549801] ffff8880066701c0 (reservation_ww_class_mutex){+.+.}-{4:4}, at: dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.550051] other info that might help us debug this:
[    1.550167] context-{5:5}
[    1.550229] 3 locks held by ncdevmem/149:
[    1.550322]  #0: ffff888005730208 (&sb->s_type->i_mutex_key#11){+.+.}-{4:4}, at: sock_close+0x40/0xf0
[    1.550530]  kernel-patches#1: ffff88800b148f98 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x19/0x80
[    1.550731]  kernel-patches#2: ffff88800b148f18 (slock-AF_INET6){+.-.}-{3:3}, at: __tcp_close+0x185/0x4b0
[    1.550921] stack backtrace:
[    1.550990] CPU: 0 UID: 0 PID: 149 Comm: ncdevmem Tainted: G        W          6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15
[    1.551233] Tainted: [W]=WARN
[    1.551304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    1.551518] Call Trace:
[    1.551584]  <TASK>
[    1.551636]  dump_stack_lvl+0x86/0xc0
[    1.551723]  __lock_acquire+0xb0f/0xc30
[    1.551814]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.551941]  lock_acquire+0xf1/0x2a0
[    1.552026]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552152]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552281]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552408]  __ww_mutex_lock+0x121/0x1060
[    1.552503]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552648]  ww_mutex_lock+0x3d/0xa0
[    1.552733]  dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552857]  __net_devmem_dmabuf_binding_free+0x56/0xb0
[    1.552979]  skb_release_data+0x120/0x1f0
[    1.553074]  __kfree_skb+0x29/0xa0
[    1.553156]  tcp_write_queue_purge+0x41/0x310
[    1.553259]  tcp_v4_destroy_sock+0x127/0x320
[    1.553363]  ? __tcp_close+0x169/0x4b0
[    1.553452]  inet_csk_destroy_sock+0x53/0x130
[    1.553560]  __tcp_close+0x421/0x4b0
[    1.553646]  tcp_close+0x24/0x80
[    1.553724]  inet_release+0x5d/0x90
[    1.553806]  sock_close+0x4a/0xf0
[    1.553886]  __fput+0x9c/0x2b0
[    1.553960]  task_work_run+0x89/0xc0
[    1.554046]  do_exit+0x27f/0x980
[    1.554125]  do_group_exit+0xa4/0xb0
[    1.554211]  __x64_sys_exit_group+0x17/0x20
[    1.554309]  x64_sys_call+0x21a0/0x21a0
[    1.554400]  do_syscall_64+0xec/0x1d0
[    1.554487]  ? exc_page_fault+0x8a/0xf0
[    1.554585]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    1.554703] RIP: 0033:0x7f2f8a27abcd

Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Feb 25, 2025
The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf,
resulting in a BUG like seen by Stan:

[    1.548495] BUG: sleeping function called from invalid context at drivers/dma-buf/dma-buf.c:1255
[    1.548741] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 149, name: ncdevmem
[    1.548926] preempt_count: 201, expected: 0
[    1.549026] RCU nest depth: 0, expected: 0
[    1.549197]
[    1.549237] =============================
[    1.549331] [ BUG: Invalid wait context ]
[    1.549425] 6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15 Tainted: G        W
[    1.549609] -----------------------------
[    1.549704] ncdevmem/149 is trying to lock:
[    1.549801] ffff8880066701c0 (reservation_ww_class_mutex){+.+.}-{4:4}, at: dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.550051] other info that might help us debug this:
[    1.550167] context-{5:5}
[    1.550229] 3 locks held by ncdevmem/149:
[    1.550322]  #0: ffff888005730208 (&sb->s_type->i_mutex_key#11){+.+.}-{4:4}, at: sock_close+0x40/0xf0
[    1.550530]  kernel-patches#1: ffff88800b148f98 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x19/0x80
[    1.550731]  kernel-patches#2: ffff88800b148f18 (slock-AF_INET6){+.-.}-{3:3}, at: __tcp_close+0x185/0x4b0
[    1.550921] stack backtrace:
[    1.550990] CPU: 0 UID: 0 PID: 149 Comm: ncdevmem Tainted: G        W          6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15
[    1.551233] Tainted: [W]=WARN
[    1.551304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    1.551518] Call Trace:
[    1.551584]  <TASK>
[    1.551636]  dump_stack_lvl+0x86/0xc0
[    1.551723]  __lock_acquire+0xb0f/0xc30
[    1.551814]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.551941]  lock_acquire+0xf1/0x2a0
[    1.552026]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552152]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552281]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552408]  __ww_mutex_lock+0x121/0x1060
[    1.552503]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552648]  ww_mutex_lock+0x3d/0xa0
[    1.552733]  dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552857]  __net_devmem_dmabuf_binding_free+0x56/0xb0
[    1.552979]  skb_release_data+0x120/0x1f0
[    1.553074]  __kfree_skb+0x29/0xa0
[    1.553156]  tcp_write_queue_purge+0x41/0x310
[    1.553259]  tcp_v4_destroy_sock+0x127/0x320
[    1.553363]  ? __tcp_close+0x169/0x4b0
[    1.553452]  inet_csk_destroy_sock+0x53/0x130
[    1.553560]  __tcp_close+0x421/0x4b0
[    1.553646]  tcp_close+0x24/0x80
[    1.553724]  inet_release+0x5d/0x90
[    1.553806]  sock_close+0x4a/0xf0
[    1.553886]  __fput+0x9c/0x2b0
[    1.553960]  task_work_run+0x89/0xc0
[    1.554046]  do_exit+0x27f/0x980
[    1.554125]  do_group_exit+0xa4/0xb0
[    1.554211]  __x64_sys_exit_group+0x17/0x20
[    1.554309]  x64_sys_call+0x21a0/0x21a0
[    1.554400]  do_syscall_64+0xec/0x1d0
[    1.554487]  ? exc_page_fault+0x8a/0xf0
[    1.554585]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    1.554703] RIP: 0033:0x7f2f8a27abcd

Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Feb 25, 2025
The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf,
resulting in a BUG like seen by Stan:

[    1.548495] BUG: sleeping function called from invalid context at drivers/dma-buf/dma-buf.c:1255
[    1.548741] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 149, name: ncdevmem
[    1.548926] preempt_count: 201, expected: 0
[    1.549026] RCU nest depth: 0, expected: 0
[    1.549197]
[    1.549237] =============================
[    1.549331] [ BUG: Invalid wait context ]
[    1.549425] 6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15 Tainted: G        W
[    1.549609] -----------------------------
[    1.549704] ncdevmem/149 is trying to lock:
[    1.549801] ffff8880066701c0 (reservation_ww_class_mutex){+.+.}-{4:4}, at: dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.550051] other info that might help us debug this:
[    1.550167] context-{5:5}
[    1.550229] 3 locks held by ncdevmem/149:
[    1.550322]  #0: ffff888005730208 (&sb->s_type->i_mutex_key#11){+.+.}-{4:4}, at: sock_close+0x40/0xf0
[    1.550530]  kernel-patches#1: ffff88800b148f98 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x19/0x80
[    1.550731]  kernel-patches#2: ffff88800b148f18 (slock-AF_INET6){+.-.}-{3:3}, at: __tcp_close+0x185/0x4b0
[    1.550921] stack backtrace:
[    1.550990] CPU: 0 UID: 0 PID: 149 Comm: ncdevmem Tainted: G        W          6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15
[    1.551233] Tainted: [W]=WARN
[    1.551304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    1.551518] Call Trace:
[    1.551584]  <TASK>
[    1.551636]  dump_stack_lvl+0x86/0xc0
[    1.551723]  __lock_acquire+0xb0f/0xc30
[    1.551814]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.551941]  lock_acquire+0xf1/0x2a0
[    1.552026]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552152]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552281]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552408]  __ww_mutex_lock+0x121/0x1060
[    1.552503]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552648]  ww_mutex_lock+0x3d/0xa0
[    1.552733]  dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552857]  __net_devmem_dmabuf_binding_free+0x56/0xb0
[    1.552979]  skb_release_data+0x120/0x1f0
[    1.553074]  __kfree_skb+0x29/0xa0
[    1.553156]  tcp_write_queue_purge+0x41/0x310
[    1.553259]  tcp_v4_destroy_sock+0x127/0x320
[    1.553363]  ? __tcp_close+0x169/0x4b0
[    1.553452]  inet_csk_destroy_sock+0x53/0x130
[    1.553560]  __tcp_close+0x421/0x4b0
[    1.553646]  tcp_close+0x24/0x80
[    1.553724]  inet_release+0x5d/0x90
[    1.553806]  sock_close+0x4a/0xf0
[    1.553886]  __fput+0x9c/0x2b0
[    1.553960]  task_work_run+0x89/0xc0
[    1.554046]  do_exit+0x27f/0x980
[    1.554125]  do_group_exit+0xa4/0xb0
[    1.554211]  __x64_sys_exit_group+0x17/0x20
[    1.554309]  x64_sys_call+0x21a0/0x21a0
[    1.554400]  do_syscall_64+0xec/0x1d0
[    1.554487]  ? exc_page_fault+0x8a/0xf0
[    1.554585]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    1.554703] RIP: 0033:0x7f2f8a27abcd

Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Feb 25, 2025
The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf,
resulting in a BUG like seen by Stan:

[    1.548495] BUG: sleeping function called from invalid context at drivers/dma-buf/dma-buf.c:1255
[    1.548741] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 149, name: ncdevmem
[    1.548926] preempt_count: 201, expected: 0
[    1.549026] RCU nest depth: 0, expected: 0
[    1.549197]
[    1.549237] =============================
[    1.549331] [ BUG: Invalid wait context ]
[    1.549425] 6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15 Tainted: G        W
[    1.549609] -----------------------------
[    1.549704] ncdevmem/149 is trying to lock:
[    1.549801] ffff8880066701c0 (reservation_ww_class_mutex){+.+.}-{4:4}, at: dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.550051] other info that might help us debug this:
[    1.550167] context-{5:5}
[    1.550229] 3 locks held by ncdevmem/149:
[    1.550322]  #0: ffff888005730208 (&sb->s_type->i_mutex_key#11){+.+.}-{4:4}, at: sock_close+0x40/0xf0
[    1.550530]  kernel-patches#1: ffff88800b148f98 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x19/0x80
[    1.550731]  kernel-patches#2: ffff88800b148f18 (slock-AF_INET6){+.-.}-{3:3}, at: __tcp_close+0x185/0x4b0
[    1.550921] stack backtrace:
[    1.550990] CPU: 0 UID: 0 PID: 149 Comm: ncdevmem Tainted: G        W          6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15
[    1.551233] Tainted: [W]=WARN
[    1.551304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    1.551518] Call Trace:
[    1.551584]  <TASK>
[    1.551636]  dump_stack_lvl+0x86/0xc0
[    1.551723]  __lock_acquire+0xb0f/0xc30
[    1.551814]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.551941]  lock_acquire+0xf1/0x2a0
[    1.552026]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552152]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552281]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552408]  __ww_mutex_lock+0x121/0x1060
[    1.552503]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552648]  ww_mutex_lock+0x3d/0xa0
[    1.552733]  dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552857]  __net_devmem_dmabuf_binding_free+0x56/0xb0
[    1.552979]  skb_release_data+0x120/0x1f0
[    1.553074]  __kfree_skb+0x29/0xa0
[    1.553156]  tcp_write_queue_purge+0x41/0x310
[    1.553259]  tcp_v4_destroy_sock+0x127/0x320
[    1.553363]  ? __tcp_close+0x169/0x4b0
[    1.553452]  inet_csk_destroy_sock+0x53/0x130
[    1.553560]  __tcp_close+0x421/0x4b0
[    1.553646]  tcp_close+0x24/0x80
[    1.553724]  inet_release+0x5d/0x90
[    1.553806]  sock_close+0x4a/0xf0
[    1.553886]  __fput+0x9c/0x2b0
[    1.553960]  task_work_run+0x89/0xc0
[    1.554046]  do_exit+0x27f/0x980
[    1.554125]  do_group_exit+0xa4/0xb0
[    1.554211]  __x64_sys_exit_group+0x17/0x20
[    1.554309]  x64_sys_call+0x21a0/0x21a0
[    1.554400]  do_syscall_64+0xec/0x1d0
[    1.554487]  ? exc_page_fault+0x8a/0xf0
[    1.554585]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    1.554703] RIP: 0033:0x7f2f8a27abcd

Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Feb 25, 2025
The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf,
resulting in a BUG like seen by Stan:

[    1.548495] BUG: sleeping function called from invalid context at drivers/dma-buf/dma-buf.c:1255
[    1.548741] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 149, name: ncdevmem
[    1.548926] preempt_count: 201, expected: 0
[    1.549026] RCU nest depth: 0, expected: 0
[    1.549197]
[    1.549237] =============================
[    1.549331] [ BUG: Invalid wait context ]
[    1.549425] 6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15 Tainted: G        W
[    1.549609] -----------------------------
[    1.549704] ncdevmem/149 is trying to lock:
[    1.549801] ffff8880066701c0 (reservation_ww_class_mutex){+.+.}-{4:4}, at: dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.550051] other info that might help us debug this:
[    1.550167] context-{5:5}
[    1.550229] 3 locks held by ncdevmem/149:
[    1.550322]  #0: ffff888005730208 (&sb->s_type->i_mutex_key#11){+.+.}-{4:4}, at: sock_close+0x40/0xf0
[    1.550530]  kernel-patches#1: ffff88800b148f98 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x19/0x80
[    1.550731]  kernel-patches#2: ffff88800b148f18 (slock-AF_INET6){+.-.}-{3:3}, at: __tcp_close+0x185/0x4b0
[    1.550921] stack backtrace:
[    1.550990] CPU: 0 UID: 0 PID: 149 Comm: ncdevmem Tainted: G        W          6.13.0-rc3-00770-gbc9ef9606dc9-dirty kernel-patches#15
[    1.551233] Tainted: [W]=WARN
[    1.551304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    1.551518] Call Trace:
[    1.551584]  <TASK>
[    1.551636]  dump_stack_lvl+0x86/0xc0
[    1.551723]  __lock_acquire+0xb0f/0xc30
[    1.551814]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.551941]  lock_acquire+0xf1/0x2a0
[    1.552026]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552152]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552281]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552408]  __ww_mutex_lock+0x121/0x1060
[    1.552503]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552648]  ww_mutex_lock+0x3d/0xa0
[    1.552733]  dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552857]  __net_devmem_dmabuf_binding_free+0x56/0xb0
[    1.552979]  skb_release_data+0x120/0x1f0
[    1.553074]  __kfree_skb+0x29/0xa0
[    1.553156]  tcp_write_queue_purge+0x41/0x310
[    1.553259]  tcp_v4_destroy_sock+0x127/0x320
[    1.553363]  ? __tcp_close+0x169/0x4b0
[    1.553452]  inet_csk_destroy_sock+0x53/0x130
[    1.553560]  __tcp_close+0x421/0x4b0
[    1.553646]  tcp_close+0x24/0x80
[    1.553724]  inet_release+0x5d/0x90
[    1.553806]  sock_close+0x4a/0xf0
[    1.553886]  __fput+0x9c/0x2b0
[    1.553960]  task_work_run+0x89/0xc0
[    1.554046]  do_exit+0x27f/0x980
[    1.554125]  do_group_exit+0xa4/0xb0
[    1.554211]  __x64_sys_exit_group+0x17/0x20
[    1.554309]  x64_sys_call+0x21a0/0x21a0
[    1.554400]  do_syscall_64+0xec/0x1d0
[    1.554487]  ? exc_page_fault+0x8a/0xf0
[    1.554585]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    1.554703] RIP: 0033:0x7f2f8a27abcd

Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kernel-patches-daemon-bpf bot pushed a commit that referenced this pull request Apr 1, 2025
Ian told me that there are many memory leaks in the hierarchy mode.  I
can easily reproduce it with the follwing command.

  $ make DEBUG=1 EXTRA_CFLAGS=-fsanitize=leak

  $ perf record --latency -g -- ./perf test -w thloop

  $ perf report -H --stdio
  ...
  Indirect leak of 168 byte(s) in 21 object(s) allocated from:
      #0 0x7f3414c16c65 in malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:75
      #1 0x55ed3602346e in map__get util/map.h:189
      #2 0x55ed36024cc4 in hist_entry__init util/hist.c:476
      #3 0x55ed36025208 in hist_entry__new util/hist.c:588
      #4 0x55ed36027c05 in hierarchy_insert_entry util/hist.c:1587
      #5 0x55ed36027e2e in hists__hierarchy_insert_entry util/hist.c:1638
      #6 0x55ed36027fa4 in hists__collapse_insert_entry util/hist.c:1685
      #7 0x55ed360283e8 in hists__collapse_resort util/hist.c:1776
      #8 0x55ed35de0323 in report__collapse_hists /home/namhyung/project/linux/tools/perf/builtin-report.c:735
      #9 0x55ed35de15b4 in __cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1119
      #10 0x55ed35de43dc in cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1867
      #11 0x55ed35e66767 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:351
      #12 0x55ed35e66a0e in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:404
      #13 0x55ed35e66b67 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:448
      #14 0x55ed35e66eb0 in main /home/namhyung/project/linux/tools/perf/perf.c:556
      #15 0x7f340ac33d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
  ...

  $ perf report -H --stdio 2>&1 | grep -c '^Indirect leak'
  93

I found that hist_entry__delete() missed to release child entries in the
hierarchy tree (hroot_{in,out}).  It needs to iterate the child entries
and call hist_entry__delete() recursively.

After this change:

  $ perf report -H --stdio 2>&1 | grep -c '^Indirect leak'
  0

Reported-by: Ian Rogers <[email protected]>
Tested-by Thomas Falcon <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 1, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 1, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 1, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 1, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 1, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 1, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kernel-patches-daemon-bpf bot pushed a commit that referenced this pull request Apr 1, 2025
…ge_order()

Patch series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT", v3.

Let's add an "easy" way to decide -- without false positives, without
page-mapcounts and without page table/rmap scanning -- whether a large
folio is "certainly mapped exclusively" into a single MM, or whether it
"maybe mapped shared" into multiple MMs.

Use that information to implement Copy-on-Write reuse, to convert
folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to
introduce a kernel config option that lets us not use+maintain per-page
mapcounts in large folios anymore.

The bigger picture was presented at LSF/MM [1].

This series is effectively a follow-up on my early work [2], which
implemented a more precise, but also more complicated, way to identify
whether a large folio is "mapped shared" into multiple MMs or "mapped
exclusively" into a single MM.


1 Patch Organization
====================

Patch #1 -> #6: make more room in order-1 folios, so we have two
                "unsigned long" available for our purposes

Patch #7 -> #11: preparations

Patch #12: MM owner tracking for large folios

Patch #13: COW reuse for PTE-mapped anon THP

Patch #14: folio_maybe_mapped_shared()

Patch #15 -> #20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT


2 MM owner tracking
===================

We assign each MM a unique ID ("MM ID"), to be able to squeeze more
information in our folios.  On 32bit we use 15-bit IDs, on 64bit we use
31-bit IDs.

For each large folios, we now store two MM-ID+mapcount ("slot")
combinations:
* mm0_id + mm0_mapcount
* mm1_id + mm1_mapcount

On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit
mapcount.  This way, we require 2x "unsigned long" on 32bit and 64bit for
both slots.

Paired with the large mapcount, we can reliably identify whether one of
these MMs is the current owner (-> owns all mappings) or even holds all
folio references (-> owns all mappings, and all references are from
mappings).

As long as only two MMs map folio pages at a time, we can reliably and
precisely identify whether a large folio is "mapped shared" or "mapped
exclusively".

Any additional MM that starts mapping the folio while there are no free
slots becomes an "untracked MM".  If one such "untracked MM" is the last
one mapping a folio exclusively, we will not detect the folio as "mapped
exclusively" but instead as "maybe mapped shared".  (exception: only a
single mapping remains)

So that's where the approach gets imprecise.

For now, we use a bit-spinlock to sync the large mapcount + slots, and
make sure we do keep the machinery fast, to not degrade (un)map
performance drastically: for example, we make sure to only use a single
atomic (when grabbing the bit-spinlock), like we would already perform
when updating the large mapcount.


3 CONFIG_NO_PAGE_MAPCOUNT
=========================

patch #15 -> #20 spell out and document what exactly is affected when not
maintaining the per-page mapcounts in large folios anymore.

Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore
when (un)mapping pages, we'll account a complete folio as mapped if a
single page is mapped.  In addition, we'll not detect partially mapped
anonymous folios as such in all cases yet.

Likely less relevant changes include that we might now under-estimate the
USS (Unique Set Size) of a process, but never over-estimate it.

The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to
then slowly make it the only option, as we learn about real-life impacts
and possible ways to mitigate them.


4 Performance
=============

Detailed performance numbers were included in v1 [3], and not that much
changed between v1 and v2.

I did plenty of measurements on different systems in the meantime, that
all revealed slightly different results.

The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code
layout changes on some systems.  Especially the fork() benchmark started
being more-shaky-than-before on recent kernels for some reason.

In summary, with my micro-benchmarks:

* Small folios are not impacted.

* CoW performance seems to be mostly unchanged across all folios sizes.

* CoW reuse performance of large folios now matches CoW reuse
  performance of small folios, because we now actually implement the CoW
  reuse optimization.  On an Intel Xeon Silver 4210R I measured a ~65%
  reduction in runtime, on an arm64 system I measured ~54% reduction.

* munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and
  up to ~70% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* munmao() performance very slightly (couple percent) degrades without
  CONFIG_NO_PAGE_MAPCOUNT for smaller folios.  For larger folios, there
  seems to be no change at all.

* fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and
  up to ~10% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be
  almost unchanged on some systems, I saw some degradation for smaller
  folios on the AmpereOne A192-32X.  I did not investigate the details
  yet, but I suspect code layout changes or suboptimal code placement /
  inlining.

I'm not to worried about the fork() micro-benchmarks for smaller folios
given how shaky the results are lately and by how much we improved fork()
performance recently.

I also ran case-anon-cow-rand and case-anon-cow-seq part of
vm-scalability, to assess the scalability and the impact of the
bit-spinlock.  My measurements on a two 2-socket 10-core Intel Xeon Silver
4210R CPU revealed no significant changes.

Similarly, running these benchmarks with 2 MiB THPs enabled on the
AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev,
which is nice.

So far, I did not get my hands on a similarly large system with multiple
sockets.

I found no other fitting scalability benchmarks that seem to really hammer
on concurrent mapping/unmapping of large folio pages like
case-anon-cow-seq does.


5 Concerns
==========

5.1 Bit spinlock
----------------

I'm not quite happy about the bit-spinlock, but so far it does not seem to
affect scalability in my measurements.

If it ever becomes a problem we could either investigate improving the
locking, or simply stopping the MM tracking once there are "too many
mappings" and simply assume that the folio is "mapped shared" until it was
freed.

This would be similar (but slightly different) to the "0,1,2,stopped"
counting idea Willy had at some point.  Adding that logic to "stop
tracking" adds more code to the hot path, so I avoided that for now.


5.2 folio_maybe_mapped_shared()
-------------------------------

I documented the change from folio_likely_mapped_shared() to
folio_maybe_mapped_shared() quite extensively.  If we run into surprises,
I have some ideas on how to resolve them.  For now, I think we should be
fine.


5.3 Added code to map/unmap hot path
------------------------------------

So far, it looks like the added code on the rmap hot path does not really
seem to matter much in the bigger picture.  I'd like to further reduce it
(and possibly improve fork() performance further), but I don't easily see
how right now.  Well, and I am out of puff 🙂

Having that said, alternatives I considered (e.g., per-MM per-folio
mapcount) would add a lot more overhead to these hot paths.


6 Future Work
=============

6.1 Large mapcount
------------------

It would be very handy if the large mapcount would count how often folio
pages are actually mapped into page tables: a PMD on x86-64 would count
512 times.  Calculating the average per-page mapcount will be easy, and
remapping (PMD->PTE) folios would get even faster.

That would also remove the need for the entire mapcount (except for
PMD-sized folios for memory statistics reasons ...), and allow for mapping
folios larger than PMDs (e.g., 4 MiB) easily.

We likely would also have to take the same number of folio references to
make our folio_mapcount() == folio_ref_count() work, and we'd want to be
able to avoid mapcount+refcount overflows: this could already become an
issue with pte-mapped PUD-sized folios (fsdax).

One approach we discussed in the THP cabal meeting is (1) extending the
mapcount for large folios to 64bit (at least on 64bit systems) and (2)
keeping the refcount at 32bit, but (3) having exactly one reference if the
the mapcount != 0.

It should be doable, but there are some corner cases to consider on the
unmap path; it is something that I will be looking into next.


6.2 hugetlb
-----------

I'd love to make use of the same tracking also for hugetlb.

The real problem is PMD table sharing: getting a page mapped by MM X and
unmapped by MM Y will not work.  With mshare, that problem should not
exist (all mapping/unmapping will be routed through the mshare MM).

[1] https://lwn.net/Articles/974223/
[2] https://lore.kernel.org/linux-mm/[email protected]/T/
[3] https://lkml.kernel.org/r/[email protected]
[4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c


This patch (of 20):

Let's factor it out into a simple helper function.  This helper will also
come in handy when working with code where we know that our folio is
large.

Maybe in the future we'll have the order readily available for small and
large folios; in that case, folio_large_order() would simply translate to
folio_order().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 1, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 2, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 2, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 2, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 2, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 2, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 2, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 2, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 2, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 3, 2025
When create ipip6 tunnel, if tunnel->parms.link is assigned to the previous
created tunnel device, the dev->needed_headroom will increase based on the
previous one.

If the number of tunnel device is sufficient, the needed_headroom can be
overflowed. The overflow happens like this:

  ipip6_newlink
    ipip6_tunnel_create
      register_netdevice
        ipip6_tunnel_init
          ipip6_tunnel_bind_dev
            t_hlen = tunnel->hlen + sizeof(struct iphdr); // 40
            hlen = tdev->hard_header_len + tdev->needed_headroom; // 65496
            dev->needed_headroom = t_hlen + hlen; // 65536 -> 0

The value of LL_RESERVED_SPACE(rt->dst.dev) may be HH_DATA_MOD, that leads
to a small skb allocated in __ip_append_data(), which triggers a
skb_under_panic:

 ------------[ cut here ]------------
 kernel BUG at net/core/skbuff.c:209!
 Oops: invalid opcode: 0000 [kernel-patches#1] PREEMPT SMP KASAN PTI
 CPU: 0 UID: 0 PID: 23587 Comm: test Tainted: G        W          6.14.0-00624-g2f2d52945852-dirty kernel-patches#15
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:skb_panic (net/core/skbuff.c:209 (discriminator 4))
 Call Trace:
  <TASK>
  skb_push (net/core/skbuff.c:2544)
  fou_build_udp (net/ipv4/fou_core.c:1041)
  gue_build_header (net/ipv4/fou_core.c:1085)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:780)
  sit_tunnel_xmit__.isra.0 (net/ipv6/sit.c:1065)
  sit_tunnel_xmit (net/ipv6/sit.c:1076)
  dev_hard_start_xmit (net/core/dev.c:3816)
  __dev_queue_xmit (net/core/dev.c:4653)
  neigh_connected_output (net/core/neighbour.c:1543)
  ip_finish_output2 (net/ipv4/ip_output.c:236)
  __ip_finish_output (net/ipv4/ip_output.c:314)
  ip_finish_output (net/ipv4/ip_output.c:324)
  ip_mc_output (net/ipv4/ip_output.c:421)
  ip_send_skb (net/ipv4/ip_output.c:1502)
  udp_send_skb (net/ipv4/udp.c:1197)
  udp_sendmsg (net/ipv4/udp.c:1484)
  udpv6_sendmsg (net/ipv6/udp.c:1545)
  inet6_sendmsg (net/ipv6/af_inet6.c:659)
  ____sys_sendmsg (net/socket.c:2573)
  ___sys_sendmsg (net/socket.c:2629)
  __sys_sendmmsg (net/socket.c:2719)
  __x64_sys_sendmmsg (net/socket.c:2740)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
 ---[ end trace 0000000000000000 ]---

Fix this by add check for needed_headroom in ipip6_tunnel_bind_dev().

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=4c63f36709a642f801c5
Fixes: c88f8d5 ("sit: update dev->needed_headroom in ipip6_tunnel_bind_dev()")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: NipaLocal <nipa@local>
kernel-patches-daemon-bpf bot pushed a commit that referenced this pull request Apr 3, 2025
When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush()
generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC,
which causes the flush_bio to be throttled by wbt_wait().

An example from v5.4, similar problem also exists in upstream:

    crash> bt 2091206
    PID: 2091206  TASK: ffff2050df92a300  CPU: 109  COMMAND: "kworker/u260:0"
     #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8
     #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4
     #2 [ffff800084a2f880] schedule at ffff800040bfa4b4
     #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4
     #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc
     #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0
     #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254
     #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38
     #8 [ffff800084a2fa60] generic_make_request at ffff800040570138
     #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4
    #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs]
    #11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs]
    #12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs]
    #13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs]
    #14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs]
    #15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs]
    #16 [ffff800084a2fdb0] process_one_work at ffff800040111d08
    #17 [ffff800084a2fe00] worker_thread at ffff8000401121cc
    #18 [ffff800084a2fe70] kthread at ffff800040118de4

After commit 2def284 ("xfs: don't allow log IO to be throttled"),
the metadata submitted by xlog_write_iclog() should not be throttled.
But due to the existence of the dm layer, throttling flush_bio indirectly
causes the metadata bio to be throttled.

Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes
wbt_should_throttle() return false to avoid wbt_wait().

Signed-off-by: Jinliang Zheng <[email protected]>
Reviewed-by: Tianxiang Peng <[email protected]>
Reviewed-by: Hao Peng <[email protected]>
Signed-off-by: Mikulas Patocka <[email protected]>
kuba-moo pushed a commit to linux-netdev/testing-bpf-ci that referenced this pull request Apr 22, 2025
Ido Schimmel says:

====================
vxlan: Convert FDB table to rhashtable

The VXLAN driver currently stores FDB entries in a hash table with a
fixed number of buckets (256), resulting in reduced performance as the
number of entries grows. This patchset solves the issue by converting
the driver to use rhashtable which maintains a more or less constant
performance regardless of the number of entries.

Measured transmitted packets per second using a single pktgen thread
with varying number of entries when the transmitted packet always hits
the default entry (worst case):

Number of entries | Improvement
------------------|------------
1k                | +1.12%
4k                | +9.22%
16k               | +55%
64k               | +585%
256k              | +2460%

The first patches are preparations for the conversion in the last patch.
Specifically, the series is structured as follows:

Patch kernel-patches#1 adds RCU read-side critical sections in the Tx path when
accessing FDB entries. Targeting at net-next as I am not aware of any
issues due to this omission despite the code being structured that way
for a long time. Without it, traces will be generated when converting
FDB lookup to rhashtable_lookup().

Patch kernel-patches#2-kernel-patches#5 simplify the creation of the default FDB entry (all-zeroes).
Current code assumes that insertion into the hash table cannot fail,
which will no longer be true with rhashtable.

Patches kernel-patches#6-kernel-patches#10 add FDB entries to a linked list for entry traversal
instead of traversing over them using the fixed size hash table which is
removed in the last patch.

Patches kernel-patches#11-kernel-patches#12 add wrappers for FDB lookup that make it clear when each
should be used along with lockdep annotations. Needed as a preparation
for rhashtable_lookup() that must be called from an RCU read-side
critical section.

Patch kernel-patches#13 treats dst cache initialization errors as non-fatal. See more
info in the commit message. The current code happens to work because
insertion into the fixed size hash table is slow enough for the per-CPU
allocator to be able to create new chunks of per-CPU memory.

Patch kernel-patches#14 adds an FDB key structure that includes the MAC address and
source VNI. To be used as rhashtable key.

Patch kernel-patches#15 does the conversion to rhashtable.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants