Skip to content

[test] bpf-next_test #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

[test] bpf-next_test #11

wants to merge 1 commit into from

Conversation

kernel-patches-bot
Copy link

branch: bpf-next_test
base:bpf-next
version: edc21dc

@kernel-patches-bot
Copy link
Author

branch: bpf-next_test
base:bpf-next
version: edc21dc

@kernel-patches-bot
Copy link
Author

branch: bpf-next_test
base:bpf-next
version: edc21dc

@kernel-patches-bot
Copy link
Author

branch: bpf-next_test
base:bpf-next
version: edc21dc

@kernel-patches-bot
Copy link
Author

branch: bpf-next_test
base:bpf-next
version: edc21dc

kernel-patches-bot pushed a commit that referenced this pull request Feb 18, 2022
When bringing down the netdevice or system shutdown, a panic can be
triggered while accessing the sysfs path because the device is already
removed.

    [  755.549084] mlx5_core 0000:12:00.1: Shutdown was called
    [  756.404455] mlx5_core 0000:12:00.0: Shutdown was called
    ...
    [  757.937260] BUG: unable to handle kernel NULL pointer dereference at           (null)
    [  758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280

    crash> bt
    ...
    PID: 12649  TASK: ffff8924108f2100  CPU: 1   COMMAND: "amsd"
    ...
     #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778
        [exception RIP: dma_pool_alloc+0x1ab]
        RIP: ffffffff8ee11acb  RSP: ffff89240e1a3968  RFLAGS: 00010046
        RAX: 0000000000000246  RBX: ffff89243d874100  RCX: 0000000000001000
        RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff89243d874090
        RBP: ffff89240e1a39c0   R8: 000000000001f080   R9: ffff8905ffc03c00
        R10: ffffffffc04680d4  R11: ffffffff8edde9fd  R12: 00000000000080d0
        R13: ffff89243d874090  R14: ffff89243d874080  R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core]
    #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core]
    #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core]
    #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core]
    #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core]
    #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core]
    #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core]
    #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46
    #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208
    #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3
    #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf
    #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596
    #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10
    #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5
    #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff
    #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f
    #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92

    crash> net_device.state ffff89443b0c0000
      state = 0x5  (__LINK_STATE_START| __LINK_STATE_NOCARRIER)

To prevent this scenario, we also make sure that the netdevice is present.

Signed-off-by: suresh kumar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 5, 2022
Ido Schimmel says:

====================
mlxsw: Various updates

This patchset contains miscellaneous updates to mlxsw gathered over
time.

Patches #1-#2 fix recent regressions present in net-next.

Patches #3-#11 are small cleanups performed while adding line card
support in mlxsw.

Patch #12 adds the SFF-8024 Identifier Value of OSFP transceiver in
order to be able to dump their EEPROM contents over the ethtool IOCTL
interface.
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 5, 2022
Ido Schimmel says:

====================
HW counters for soft devices

Petr says:

Offloading switch device drivers may be able to collect statistics of the
traffic taking place in the HW datapath that pertains to a certain soft
netdevice, such as a VLAN. In this patch set, add the necessary
infrastructure to allow exposing these statistics to the offloaded
netdevice in question, and add mlxsw offload.

Across HW platforms, the counter itself very likely constitutes a limited
resource, and the act of counting may have a performance impact. Therefore
this patch set makes the HW statistics collection opt-in and togglable from
userspace on a per-netdevice basis.

Additionally, HW devices may have various limiting conditions under which
they can realize the counter. Therefore it is also possible to query
whether the requested counter is realized by any driver. In TC parlance,
which is to a degree reused in this patch set, two values are recognized:
"request" tracks whether the user enabled collecting HW statistics, and
"used" tracks whether any HW statistics are actually collected.

In the past, this author has expressed the opinion that `a typical user
doing "ip -s l sh", including various scripts, wants to see the full
picture and not worry what's going on where'. While that would be nice,
unfortunately it cannot work:

- Packets that trap from the HW datapath to the SW datapath would be
  double counted.

  For a given netdevice, some traffic can be purely a SW artifact, and some
  may flow through the HW object corresponding to the netdevice. But some
  traffic can also get trapped to the SW datapath after bumping the HW
  counter. It is not clear how to make sure double-counting does not occur
  in the SW datapath in that case, while still making sure that possibly
  divergent SW forwarding path gets bumped as appropriate.

  So simply adding HW and SW stats may work roughly, most of the time, but
  there are scenarios where the result is nonsensical.

- HW devices will have limitations as to what type of traffic they can
  count.

  In case of mlxsw, which is part of this patch set, there is no reasonable
  way to count all traffic going through a certain netdevice, such as a
  VLAN netdevice enslaved to a bridge. It is however very simple to count
  traffic flowing through an L3 object, such as a VLAN netdevice with an IP
  address.

  Similarly for physical netdevices, the L3 object at which the counter is
  installed is the subport carrying untagged traffic.

  These are not "just counters". It is important that the user understands
  what is being counted. It would be incorrect to conflate these statistics
  with another existing statistics suite.

To that end, this patch set introduces a statistics suite called "L3
stats". This label should make it easy to understand what is being counted,
and to decide whether a given device can or cannot implement this suite for
some type of netdevice. At the same time, the code is written to make
future extensions easy, should a device pop up that can implement a
different flavor of statistics suite (say L2, or an address-family-specific
suite).

For example, using a work-in-progress iproute2[1], to turn on and then list
the counters on a VLAN netdevice:

    # ip stats set dev swp1.200 l3_stats on
    # ip stats show dev swp1.200 group offload subgroup l3_stats
    56: swp1.200: group offload subgroup l3_stats on used on
	RX:  bytes packets errors dropped  missed   mcast
		0       0      0       0       0       0
	TX:  bytes packets errors dropped carrier collsns
		0       0      0       0       0       0

The patchset progresses as follows:

- Patch #1 is a cleanup.

- In patch #2, remove the assumption that all LINK_OFFLOAD_XSTATS are
  dev-backed.

  The only attribute defined under the nest is currently
  IFLA_OFFLOAD_XSTATS_CPU_HIT. L3_STATS differs from CPU_HIT in that the
  driver that supplies the statistics is not the same as the driver that
  implements the netdevice. Make the code compatible with this in patch #2.

- In patch #3, add the possibility to filter inside nests.

  The filter_mask field of RTM_GETSTATS header determines which
  top-level attributes should be included in the netlink response. This
  saves processing time by only including the bits that the user cares
  about instead of always dumping everything. This is doubly important
  for HW-backed statistics that would typically require a trip to the
  device to fetch the stats. In this patch, the UAPI is extended to
  allow filtering inside IFLA_STATS_LINK_OFFLOAD_XSTATS in particular,
  but the scheme is easily extensible to other nests as well.

- In patch #4, propagate extack where we need it.
  In patch #5, make it possible to propagate errors from drivers to the
  user.

- In patch #6, add the in-kernel APIs for keeping track of the new stats
  suite, and the notifiers that the core uses to communicate with the
  drivers.

- In patch #7, add UAPI for obtaining the new stats suite.

- In patch #8, add a new UAPI message, RTM_SETSTATS, which will carry
  the message to toggle the newly-added stats suite.
  In patch #9, add the toggle itself.

At this point the core is ready for drivers to add support for the new
stats suite.

- In patches #10, #11 and #12, apply small tweaks to mlxsw code.

- In patch #13, add support for L3 stats, which are realized as RIF
  counters.

- Finally in patch #14, a selftest is added to the net/forwarding
  directory. Technically this is a HW-specific test, in that without a HW
  implementing the counters, it just will not pass. But devices that
  support L3 statistics at all are likely to be able to reuse this
  selftest, so it seems appropriate to put it in the general forwarding
  directory.

We also have a netdevsim implementation, and a corresponding selftest that
verifies specifically some of the core code. We intend to contribute these
later. Interested parties can take a look at the raw code at [2].

[1] https://github.com/pmachata/iproute2/commits/soft_counters
[2] https://github.com/pmachata/linux_mlxsw/commits/petrm_soft_counters_2

v2:
- Patch #3:
    - Do not declare strict_start_type at the new policies, since they are
      used with nla_parse_nested() (sans _deprecated).
    - Use NLA_POLICY_NESTED to declare what the nest contents should be
    - Use NLA_POLICY_MASK instead of BITFIELD32 for the filtering
      attribute.
- Patch #6:
    - s/monotonous/monotonic/ in commit message
    - Use a newly-added struct rtnl_hw_stats64 for stats transfer
- Patch #7:
    - Use a newly-added struct rtnl_hw_stats64 for stats transfer
- Patch #8:
    - Do not declare strict_start_type at the new policies, since they are
      used with nla_parse_nested() (sans _deprecated).
- Patch #13:
    - Use a newly-added struct rtnl_hw_stats64 for stats transfer
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Mar 29, 2022
In remove_phb_dynamic() we use &phb->io_resource, after we've called
device_unregister(&host_bridge->dev). But the unregister may have freed
phb, because pcibios_free_controller_deferred() is the release function
for the host_bridge.

If there are no outstanding references when we call device_unregister()
then phb will be freed out from under us.

This has gone mainly unnoticed, but with slub_debug and page_poison
enabled it can lead to a crash:

  PID: 7574   TASK: c0000000d492cb80  CPU: 13  COMMAND: "drmgr"
   #0 [c0000000e4f075a0] crash_kexec at c00000000027d7dc
   #1 [c0000000e4f075d0] oops_end at c000000000029608
   #2 [c0000000e4f07650] __bad_page_fault at c0000000000904b4
   #3 [c0000000e4f076c0] do_bad_slb_fault at c00000000009a5a8
   #4 [c0000000e4f076f0] data_access_slb_common_virt at c000000000008b30
   Data SLB Access [380] exception frame:
   R0:  c000000000167250    R1:  c0000000e4f07a00    R2:  c000000002a46100
   R3:  c000000002b39ce8    R4:  00000000000000c0    R5:  00000000000000a9
   R6:  3894674d000000c0    R7:  0000000000000000    R8:  00000000000000ff
   R9:  0000000000000100    R10: 6b6b6b6b6b6b6b6b    R11: 0000000000008000
   R12: c00000000023da80    R13: c0000009ffd38b00    R14: 0000000000000000
   R15: 000000011c87f0f0    R16: 0000000000000006    R17: 0000000000000003
   R18: 0000000000000002    R19: 0000000000000004    R20: 0000000000000005
   R21: 000000011c87ede8    R22: 000000011c87c5a8    R23: 000000011c87d3a0
   R24: 0000000000000000    R25: 0000000000000001    R26: c0000000e4f07cc8
   R27: c00000004d1cc400    R28: c0080000031d00e8    R29: c00000004d23d800
   R30: c00000004d1d2400    R31: c00000004d1d2540
   NIP: c000000000167258    MSR: 8000000000009033    OR3: c000000000e9f474
   CTR: 0000000000000000    LR:  c000000000167250    XER: 0000000020040003
   CCR: 0000000024088420    MQ:  0000000000000000    DAR: 6b6b6b6b6b6b6ba3
   DSISR: c0000000e4f07920     Syscall Result: fffffffffffffff2
   [NIP  : release_resource+56]
   [LR   : release_resource+48]
   #5 [c0000000e4f07a00] release_resource at c000000000167258  (unreliable)
   #6 [c0000000e4f07a30] remove_phb_dynamic at c000000000105648
   #7 [c0000000e4f07ab0] dlpar_remove_slot at c0080000031a09e8 [rpadlpar_io]
   #8 [c0000000e4f07b50] remove_slot_store at c0080000031a0b9c [rpadlpar_io]
   #9 [c0000000e4f07be0] kobj_attr_store at c000000000817d8c
  #10 [c0000000e4f07c00] sysfs_kf_write at c00000000063e504
  #11 [c0000000e4f07c20] kernfs_fop_write_iter at c00000000063d868
  #12 [c0000000e4f07c70] new_sync_write at c00000000054339c
  #13 [c0000000e4f07d10] vfs_write at c000000000546624
  #14 [c0000000e4f07d60] ksys_write at c0000000005469f4
  #15 [c0000000e4f07db0] system_call_exception at c000000000030840
  #16 [c0000000e4f07e10] system_call_vectored_common at c00000000000c168

To avoid it, we can take a reference to the host_bridge->dev until we're
done using phb. Then when we drop the reference the phb will be freed.

Fixes: 2dd9c11 ("powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)")
Reported-by: David Dai <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Tested-by: Sachin Sant <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
kernel-patches-bot pushed a commit that referenced this pull request Apr 9, 2022
Ido Schimmel says:

====================
net/sched: Better error reporting for offload failures

This patchset improves error reporting to user space when offload fails
during the flow action setup phase. That is, when failures occur in the
actions themselves, even before calling device drivers. Requested /
reported in [1].

This is done by passing extack to the offload_act_setup() callback and
making use of it in the various actions.

Patches #1-#2 change matchall and flower to log error messages to user
space in accordance with the verbose flag.

Patch #3 passes extack to the offload_act_setup() callback from the
various call sites, including matchall and flower.

Patches #4-#11 make use of extack in the various actions to report
offload failures.

Patch #12 adds an error message when the action does not support offload
at all.

Patches #13-#14 change matchall and flower to stop overwriting more
specific error messages.

[1] https://lore.kernel.org/netdev/20220317185249.5mff5u2x624pjewv@skbuf/
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request May 12, 2022
Kernel panic when injecting memory_failure for the global
huge_zero_page, when CONFIG_DEBUG_VM is enabled, as follows.

  Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
  page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
  head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
  flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
  raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
  raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
  page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
  ------------[ cut here ]------------
  kernel BUG at mm/huge_memory.c:2499!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
  Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
  RIP: 0010:split_huge_page_to_list+0x66a/0x880
  Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
  RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
  RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
  RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
  R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
  R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
  FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   try_to_split_thp_page+0x3a/0x130
   memory_failure+0x128/0x800
   madvise_inject_error.cold+0x8b/0xa1
   __x64_sys_madvise+0x54/0x60
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x7fc3754f8bf9
  Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
  RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
  RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
  RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
  R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
  R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000

This makes huge_zero_page bail out explicitly before split in
memory_failure(), thus the panic above won't happen again.

Link: https://lkml.kernel.org/r/497d3835612610e370c74e697ea3c721d1d55b9c.1649775850.git.xuyu@linux.alibaba.com
Fixes: 6a46079 ("HWPOISON: The high level memory error handler in the VM v7")
Signed-off-by: Xu Yu <[email protected]>
Reported-by: Abaci <[email protected]>
Suggested-by: Naoya Horiguchi <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request May 23, 2022
Kernel panic when injecting memory_failure for the global huge_zero_page,
when CONFIG_DEBUG_VM is enabled, as follows.

  Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
  page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
  head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
  flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
  raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
  raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
  page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
  ------------[ cut here ]------------
  kernel BUG at mm/huge_memory.c:2499!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
  Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
  RIP: 0010:split_huge_page_to_list+0x66a/0x880
  Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
  RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
  RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
  RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
  R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
  R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
  FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
  try_to_split_thp_page+0x3a/0x130
  memory_failure+0x128/0x800
  madvise_inject_error.cold+0x8b/0xa1
  __x64_sys_madvise+0x54/0x60
  do_syscall_64+0x35/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x7fc3754f8bf9
  Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
  RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
  RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
  RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
  R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
  R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000

We think that raising BUG is overkilling for splitting huge_zero_page, the
huge_zero_page can't be met from normal paths other than memory failure,
but memory failure is a valid caller.  So we tend to replace the BUG to
WARN + returning -EBUSY, and thus the panic above won't happen again.

Link: https://lkml.kernel.org/r/f35f8b97377d5d3ede1bc5ac3114da888c57cbce.1651052574.git.xuyu@linux.alibaba.com
Fixes: d173d54 ("mm/memory-failure.c: skip huge_zero_page in memory_failure()")
Fixes: 6a46079 ("HWPOISON: The high level memory error handler in the VM v7")
Signed-off-by: Xu Yu <[email protected]>
Suggested-by: Yang Shi <[email protected]>
Reported-by: kernel test robot <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request May 23, 2022
Do not allow to write timestamps on RX rings if PF is being configured.
When PF is being configured RX rings can be freed or rebuilt. If at the
same time timestamps are updated, the kernel will crash by dereferencing
null RX ring pointer.

PID: 1449   TASK: ff187d28ed658040  CPU: 34  COMMAND: "ice-ptp-0000:51"
 #0 [ff1966a94a713bb0] machine_kexec at ffffffff9d05a0be
 #1 [ff1966a94a713c08] __crash_kexec at ffffffff9d192e9d
 #2 [ff1966a94a713cd0] crash_kexec at ffffffff9d1941bd
 #3 [ff1966a94a713ce8] oops_end at ffffffff9d01bd54
 #4 [ff1966a94a713d08] no_context at ffffffff9d06bda4
 #5 [ff1966a94a713d60] __bad_area_nosemaphore at ffffffff9d06c10c
 #6 [ff1966a94a713da8] do_page_fault at ffffffff9d06cae4
 #7 [ff1966a94a713de0] page_fault at ffffffff9da0107e
    [exception RIP: ice_ptp_update_cached_phctime+91]
    RIP: ffffffffc076db8b  RSP: ff1966a94a713e98  RFLAGS: 00010246
    RAX: 16e3db9c6b7ccae4  RBX: ff187d269dd3c180  RCX: ff187d269cd4d018
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ff187d269cfcc644   R8: ff187d339b9641b0   R9: 0000000000000000
    R10: 0000000000000002  R11: 0000000000000000  R12: ff187d269cfcc648
    R13: ffffffff9f128784  R14: ffffffff9d101b70  R15: ff187d269cfcc640
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ff1966a94a713ea0] ice_ptp_periodic_work at ffffffffc076dbef [ice]
 #9 [ff1966a94a713ee0] kthread_worker_fn at ffffffff9d101c1b
 #10 [ff1966a94a713f10] kthread at ffffffff9d101b4d
 #11 [ff1966a94a713f50] ret_from_fork at ffffffff9da0023f

Fixes: 77a7811 ("ice: enable receive hardware timestamping")
Signed-off-by: Arkadiusz Kubalewski <[email protected]>
Reviewed-by: Michal Schmidt <[email protected]>
Tested-by: Dave Cain <[email protected]>
Tested-by: Gurucharan <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Jun 14, 2022
OFFLOADS paring using devcom is possible only on devices
that support LAG. Filter based on lag capabilities.

This fixes an issue where mlx5_get_next_phys_dev() was
called without holding the interface lock.

This issue was found when commit
bc4c2f2 ("net/mlx5: Lag, filter non compatible devices")
added an assert that verifies the interface lock is held.

WARNING: CPU: 9 PID: 1706 at drivers/net/ethernet/mellanox/mlx5/core/dev.c:642 mlx5_get_next_phys_dev+0xd2/0x100 [mlx5_core]
Modules linked in: mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_umad ib_ipoib ib_cm ib_uverbs ib_core overlay fuse [last unloaded: mlx5_core]
CPU: 9 PID: 1706 Comm: devlink Not tainted 5.18.0-rc7+ #11
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:mlx5_get_next_phys_dev+0xd2/0x100 [mlx5_core]
Code: 02 00 75 48 48 8b 85 80 04 00 00 5d c3 31 c0 5d c3 be ff ff ff ff 48 c7 c7 08 41 5b a0 e8 36 87 28 e3 85 c0 0f 85 6f ff ff ff <0f> 0b e9 68 ff ff ff 48 c7 c7 0c 91 cc 84 e8 cb 36 6f e1 e9 4d ff
RSP: 0018:ffff88811bf47458 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88811b398000 RCX: 0000000000000001
RDX: 0000000080000000 RSI: ffffffffa05b4108 RDI: ffff88812daaaa78
RBP: ffff88812d050380 R08: 0000000000000001 R09: ffff88811d6b3437
R10: 0000000000000001 R11: 00000000fddd3581 R12: ffff88815238c000
R13: ffff88812d050380 R14: ffff8881018aa7e0 R15: ffff88811d6b3428
FS:  00007fc82e18ae80(0000) GS:ffff88842e080000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9630d1b421 CR3: 0000000149802004 CR4: 0000000000370ea0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 mlx5_esw_offloads_devcom_event+0x99/0x3b0 [mlx5_core]
 mlx5_devcom_send_event+0x167/0x1d0 [mlx5_core]
 esw_offloads_enable+0x1153/0x1500 [mlx5_core]
 ? mlx5_esw_offloads_controller_valid+0x170/0x170 [mlx5_core]
 ? wait_for_completion_io_timeout+0x20/0x20
 ? mlx5_rescan_drivers_locked+0x318/0x810 [mlx5_core]
 mlx5_eswitch_enable_locked+0x586/0xc50 [mlx5_core]
 ? mlx5_eswitch_disable_pf_vf_vports+0x1d0/0x1d0 [mlx5_core]
 ? mlx5_esw_try_lock+0x1b/0xb0 [mlx5_core]
 ? mlx5_eswitch_enable+0x270/0x270 [mlx5_core]
 ? __debugfs_create_file+0x260/0x3e0
 mlx5_devlink_eswitch_mode_set+0x27e/0x870 [mlx5_core]
 ? mutex_lock_io_nested+0x12c0/0x12c0
 ? esw_offloads_disable+0x250/0x250 [mlx5_core]
 ? devlink_nl_cmd_trap_get_dumpit+0x470/0x470
 ? rcu_read_lock_sched_held+0x3f/0x70
 devlink_nl_cmd_eswitch_set_doit+0x217/0x620

Fixes: dd3fddb ("net/mlx5: E-Switch, handle devcom events only for ports on the same device")
Signed-off-by: Mark Bloch <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Reviewed-by: Moshe Shemesh <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Jun 18, 2022
Ido Schimmel says:

====================
mlxsw: L3 HW stats improvements

While testing L3 HW stats [1] on top of mlxsw, two issues were found:

1. Stats cannot be enabled for more than 205 netdevs. This was fixed in
commit 4b7a632 ("mlxsw: spectrum_cnt: Reorder counter pools").

2. ARP packets are counted as errors. Patch #1 takes care of that. See
the commit message for details.

The goal of the majority of the rest of the patches is to add selftests
that would have discovered that only about 205 netdevs can have L3 HW
stats supported, despite the HW supporting much more. The obvious place
to plug this in is the scale test framework.

The scale tests are currently testing two things: that some number of
instances of a given resource can actually be created; and that when an
attempt is made to create more than the supported amount, the failures
are noted and handled gracefully.

However the ability to allocate the resource does not mean that the
resource actually works when passing traffic. For that, make it possible
for a given scale to also test traffic.

To that end, this patchset adds traffic tests. The goal of these is to
run traffic and observe whether a sample of the allocated resource
instances actually perform their task. Traffic tests are only run on the
positive leg of the scale test (no point trying to pass traffic when the
expected outcome is that the resource will not be allocated). They are
opt-in, if a given test does not expose it, it is not run.

The patchset proceeds as follows:

- Patches #2 and #3 add to "devlink resource" support for number of
  allocated RIFs, and the capacity. This is necessary, because when
  evaluating how many L3 HW stats instances it should be possible to
  allocate, the limiting resource on Spectrum-2 and above currently is
  not the counters themselves, but actually the RIFs.

- Patch #6 adds support for invocation of a traffic test, if a given scale
  tests exposes it.

- Patch #7 adds support for skipping a given scale test. Because on
  Spectrum-2 and above, the limiting factor to L3 HW stats instances is
  actually the number of RIFs, there is no point in running the failing leg
  of a scale tests, because it would test exhaustion of RIFs, not of RIF
  counters.

- With patch #8, the scale tests drivers pass the target number to the
  cleanup function of a scale test.

- In patch #9, add a traffic test to the tc_flower selftests. This makes
  sure that the flow counters installed with the ACLs actually do count as
  they are supposed to.

- In patch #10, add a new scale selftest for RIF counter scale, including a
  traffic test.

- In patch #11, the scale target for the tc_flower selftest is
  dynamically set instead of being hard coded.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca0a53dcec9495d1dc5bbc369c810c520d728373
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Jul 10, 2022
Ido Schimmel says:

====================
mlxsw: Unified bridge conversion - part 2/6

This is the second part of the conversion of mlxsw to the unified bridge
model. Part 1 was merged in commit 4336487 ("Merge branch
'mlxsw-unified-bridge-conversion-part-1'") which includes details about
the new model and the motivation behind the conversion.

This patchset does not begin the conversion, but rather prepares the code
base for it.

Patchset overview:

Patch #1 removes an unnecessary field from one of the FID families.

Patches #2-#7 make various improvements in the layer 2 multicast code,
making it more receptive towards upcoming changes.

Patches #8-#10 prepare the CONFIG_PROFILE command for the unified bridge
model. This command will be used to enable the new model in the last
patchset.

Patches #11-#13 perform small changes in the FID code, preparing it for
upcoming changes.
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Jul 10, 2022
Ido Schimmel says:

====================
mlxsw: Unified bridge conversion - part 4/6

This is the fourth part of the conversion of mlxsw to the unified bridge
model.

Unlike previous parts that prepared mlxsw for the conversion, this part
actually starts the conversion. It focuses on flooding configuration and
converts mlxsw to the more "raw" APIs of the unified bridge model.

The patches configure the different stages of the flooding pipeline in
Spectrum that looks as follows (at a high-level):

         +------------+                +----------+           +-------+
  {FID,  |            | {Packet type,  |          |           |       |  MID
   DMAC} | FDB lookup |  Bridge type}  |   SFGC   | MID base  |       | Index
+-------->   (miss)   +----------------> register +-----------> Adder +------->
         |            |                |          |           |       |
         |            |                |          |           |       |
         +------------+                +----+-----+           +---^---+
                                            |                     |
                                    Table   |                     |
                                     type   |                     | Offset
                                            |      +-------+      |
                                            |      |       |      |
                                            |      |       |      |
                                            +----->+  Mux  +------+
                                                   |       |
                                                   |       |
                                                   +-^---^-+
                                                     |   |
                                                  FID|   |FID
                                                     |   |offset
                                                     +   +

The multicast identifier (MID) index is used as an index to the port
group table (PGT) that contains a bitmap of ports via which a packet
needs to be replicated.

From the PGT table, the packet continues to the multicast port egress
(MPE) table that determines the packet's egress VLAN. This is a
two-dimensional table that is indexed by port and switch multicast port
to egress (SMPE) index. The latter can be thought of as a FID. Without
it, all the packets replicated via a certain port would get the same
VLAN, regardless of the bridge domain (FID).

Logically, these two steps look as follows:

                     PGT table                           MPE table
             +-----------------------+               +---------------+
             |                       | {Local port,  |               | Egress
  MID index  | Local ports bitmap #1 |  SMPE index}  |               |  VID
+------------>        ...            +--------------->               +-------->
             | Local ports bitmap #N |               |               |
             |                       |          SMPE |               |
             +-----------------------+               +---------------+
                                                        Local port

Patchset overview:

Patch #1 adds a variable to guard against mixed model configuration.
Will be removed in part 6 when mlxsw is fully converted to the unified
model.

Patches #2-#5 introduce two new FID attributes required for flooding
configuration in the new model:

1. 'flood_rsp': Instructs the firmware to handle flooding configuration
for this FID. Only set for router FIDs (rFIDs) which are used to connect
a {Port, VLAN} to the router block.

2. 'bridge_type': Allows the device to determine the flood table (i.e.,
base index to the PGT table) for the FID. The first type will be used
for FIDs in a VLAN-aware bridge and the second for FIDs representing
VLAN-unaware bridges.

Patch #6 configures the MPE table that determines the egress VLAN of a
packet that is forwarded according to L2 multicast / flood.

Patches #7-#11 add the PGT table and related APIs to allocate entries
and set / clear ports in them.

Patches #12-#13 convert the flooding configuration to use the new PGT
APIs.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Jul 10, 2022
Ido Schimmel says:

====================
mlxsw: Unified bridge conversion - part 6/6

This is the sixth and final part of the conversion of mlxsw to the
unified bridge model. It transitions the last bits of functionality that
were under firmware's responsibility in the legacy model to the driver.
The last patches flip the driver to the unified bridge model and clean
up code that was used to make the conversion easier to review.

Patchset overview:

Patch #1 sets the egress VID for known unicast packets. For multicast
packets, the egress VID is configured using the MPE table. See commit
8c2da08 ("mlxsw: spectrum_fid: Configure egress VID classification
for multicast").

Patch #2 configures the VNI to FID classification that is used during
decapsulation.

Patch #3 configures ingress router interface (RIF) in FID classification
records, so that when a packet reaches the router block, its ingress RIF
is known. Care is taken to configure this in all the different flows
(e.g., RIF set on a FID, {Port, VID} joins a FID that already has a RIF
etc.).

Patch #4 configures the egress VID for routed packets. For such packets,
the egress VID is not set by the MPE table or by an FDB record at the
egress bridge, but instead by a dedicated table that maps {Egress RIF,
Egress port} to a VID.

Patch #5 removes VID configuration from RIF creation as in the unified
bridge model firmware no longer needs it.

Patch #6 sets the egress FID to use in RIF configuration so that the
device knows using which FID to bridge the packet after routing.

Patches #7-#9 add a new 802.1Q family and associated VLAN RIFs. In the
unified bridge model, we no longer need to emulate 802.1Q FIDs using
802.1D FIDs as VNI can be associated with both.

Patches #10-#11 finally flip the driver to the unified bridge model.

Patches #12-#13 clean up code that was used to make the conversion
easier to review.

v2:
* Fix build failure [1] in patch #1.

[1] https://lore.kernel.org/netdev/[email protected]/
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Aug 26, 2022
With special lengths supplied by user space, register_shm_helper() has
an integer overflow when calculating the number of pages covered by a
supplied user space memory region.

This causes internal_get_user_pages_fast() a helper function of
pin_user_pages_fast() to do a NULL pointer dereference:

  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
  Modules linked in:
  CPU: 1 PID: 173 Comm: optee_example_a Not tainted 5.19.0 #11
  Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
  pc : internal_get_user_pages_fast+0x474/0xa80
  Call trace:
   internal_get_user_pages_fast+0x474/0xa80
   pin_user_pages_fast+0x24/0x4c
   register_shm_helper+0x194/0x330
   tee_shm_register_user_buf+0x78/0x120
   tee_ioctl+0xd0/0x11a0
   __arm64_sys_ioctl+0xa8/0xec
   invoke_syscall+0x48/0x114

Fix this by adding an an explicit call to access_ok() in
tee_shm_register_user_buf() to catch an invalid user space address
early.

Fixes: 033ddf1 ("tee: add register user memory")
Cc: [email protected]
Reported-by: Nimish Mishra <[email protected]>
Reported-by: Anirban Chakraborty <[email protected]>
Reported-by: Debdeep Mukhopadhyay <[email protected]>
Suggested-by: Jerome Forissier <[email protected]>
Signed-off-by: Jens Wiklander <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Oct 10, 2022
ASAN reports an use-after-free in btf_dump_name_dups:

ERROR: AddressSanitizer: heap-use-after-free on address 0xffff927006db at pc 0xaaaab5dfb618 bp 0xffffdd89b890 sp 0xffffdd89b928
READ of size 2 at 0xffff927006db thread T0
    #0 0xaaaab5dfb614 in __interceptor_strcmp.part.0 (test_progs+0x21b614)
    #1 0xaaaab635f144 in str_equal_fn tools/lib/bpf/btf_dump.c:127
    #2 0xaaaab635e3e0 in hashmap_find_entry tools/lib/bpf/hashmap.c:143
    #3 0xaaaab635e72c in hashmap__find tools/lib/bpf/hashmap.c:212
    #4 0xaaaab6362258 in btf_dump_name_dups tools/lib/bpf/btf_dump.c:1525
    #5 0xaaaab636240c in btf_dump_resolve_name tools/lib/bpf/btf_dump.c:1552
    #6 0xaaaab6362598 in btf_dump_type_name tools/lib/bpf/btf_dump.c:1567
    #7 0xaaaab6360b48 in btf_dump_emit_struct_def tools/lib/bpf/btf_dump.c:912
    #8 0xaaaab6360630 in btf_dump_emit_type tools/lib/bpf/btf_dump.c:798
    #9 0xaaaab635f720 in btf_dump__dump_type tools/lib/bpf/btf_dump.c:282
    #10 0xaaaab608523c in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:236
    #11 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #12 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #13 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #14 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #15 0xaaaab5d65990  (test_progs+0x185990)

0xffff927006db is located 11 bytes inside of 16-byte region [0xffff927006d0,0xffff927006e0)
freed by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353e10 in btf__add_field tools/lib/bpf/btf.c:2032
    #7 0xaaaab6084fcc in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:232
    #8 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #9 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #10 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #11 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #12 0xaaaab5d65990  (test_progs+0x185990)

previously allocated by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353ff0 in btf_add_enum_common tools/lib/bpf/btf.c:2070
    #7 0xaaaab6354080 in btf__add_enum tools/lib/bpf/btf.c:2102
    #8 0xaaaab6082f50 in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:162
    #9 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #10 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #11 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #12 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #13 0xaaaab5d65990  (test_progs+0x185990)

The reason is that the key stored in hash table name_map is a string
address, and the string memory is allocated by realloc() function, when
the memory is resized by realloc() later, the old memory may be freed,
so the address stored in name_map references to a freed memory, causing
use-after-free.

Fix it by storing duplicated string address in name_map.

Fixes: 351131b ("libbpf: add btf_dump API for BTF-to-C conversion")
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Oct 10, 2022
ASAN reports an use-after-free in btf_dump_name_dups:

ERROR: AddressSanitizer: heap-use-after-free on address 0xffff927006db at pc 0xaaaab5dfb618 bp 0xffffdd89b890 sp 0xffffdd89b928
READ of size 2 at 0xffff927006db thread T0
    #0 0xaaaab5dfb614 in __interceptor_strcmp.part.0 (test_progs+0x21b614)
    #1 0xaaaab635f144 in str_equal_fn tools/lib/bpf/btf_dump.c:127
    #2 0xaaaab635e3e0 in hashmap_find_entry tools/lib/bpf/hashmap.c:143
    #3 0xaaaab635e72c in hashmap__find tools/lib/bpf/hashmap.c:212
    #4 0xaaaab6362258 in btf_dump_name_dups tools/lib/bpf/btf_dump.c:1525
    #5 0xaaaab636240c in btf_dump_resolve_name tools/lib/bpf/btf_dump.c:1552
    #6 0xaaaab6362598 in btf_dump_type_name tools/lib/bpf/btf_dump.c:1567
    #7 0xaaaab6360b48 in btf_dump_emit_struct_def tools/lib/bpf/btf_dump.c:912
    #8 0xaaaab6360630 in btf_dump_emit_type tools/lib/bpf/btf_dump.c:798
    #9 0xaaaab635f720 in btf_dump__dump_type tools/lib/bpf/btf_dump.c:282
    #10 0xaaaab608523c in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:236
    #11 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #12 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #13 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #14 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #15 0xaaaab5d65990  (test_progs+0x185990)

0xffff927006db is located 11 bytes inside of 16-byte region [0xffff927006d0,0xffff927006e0)
freed by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353e10 in btf__add_field tools/lib/bpf/btf.c:2032
    #7 0xaaaab6084fcc in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:232
    #8 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #9 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #10 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #11 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #12 0xaaaab5d65990  (test_progs+0x185990)

previously allocated by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353ff0 in btf_add_enum_common tools/lib/bpf/btf.c:2070
    #7 0xaaaab6354080 in btf__add_enum tools/lib/bpf/btf.c:2102
    #8 0xaaaab6082f50 in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:162
    #9 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #10 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #11 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #12 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #13 0xaaaab5d65990  (test_progs+0x185990)

The reason is that the key stored in hash table name_map is a string
address, and the string memory is allocated by realloc() function, when
the memory is resized by realloc() later, the old memory may be freed,
so the address stored in name_map references to a freed memory, causing
use-after-free.

Fix it by storing duplicated string address in name_map.

Fixes: 351131b ("libbpf: add btf_dump API for BTF-to-C conversion")
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Oct 11, 2022
ASAN reports an use-after-free in btf_dump_name_dups:

ERROR: AddressSanitizer: heap-use-after-free on address 0xffff927006db at pc 0xaaaab5dfb618 bp 0xffffdd89b890 sp 0xffffdd89b928
READ of size 2 at 0xffff927006db thread T0
    #0 0xaaaab5dfb614 in __interceptor_strcmp.part.0 (test_progs+0x21b614)
    #1 0xaaaab635f144 in str_equal_fn tools/lib/bpf/btf_dump.c:127
    #2 0xaaaab635e3e0 in hashmap_find_entry tools/lib/bpf/hashmap.c:143
    #3 0xaaaab635e72c in hashmap__find tools/lib/bpf/hashmap.c:212
    #4 0xaaaab6362258 in btf_dump_name_dups tools/lib/bpf/btf_dump.c:1525
    #5 0xaaaab636240c in btf_dump_resolve_name tools/lib/bpf/btf_dump.c:1552
    #6 0xaaaab6362598 in btf_dump_type_name tools/lib/bpf/btf_dump.c:1567
    #7 0xaaaab6360b48 in btf_dump_emit_struct_def tools/lib/bpf/btf_dump.c:912
    #8 0xaaaab6360630 in btf_dump_emit_type tools/lib/bpf/btf_dump.c:798
    #9 0xaaaab635f720 in btf_dump__dump_type tools/lib/bpf/btf_dump.c:282
    #10 0xaaaab608523c in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:236
    #11 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #12 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #13 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #14 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #15 0xaaaab5d65990  (test_progs+0x185990)

0xffff927006db is located 11 bytes inside of 16-byte region [0xffff927006d0,0xffff927006e0)
freed by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353e10 in btf__add_field tools/lib/bpf/btf.c:2032
    #7 0xaaaab6084fcc in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:232
    #8 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #9 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #10 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #11 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #12 0xaaaab5d65990  (test_progs+0x185990)

previously allocated by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353ff0 in btf_add_enum_common tools/lib/bpf/btf.c:2070
    #7 0xaaaab6354080 in btf__add_enum tools/lib/bpf/btf.c:2102
    #8 0xaaaab6082f50 in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:162
    #9 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #10 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #11 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #12 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #13 0xaaaab5d65990  (test_progs+0x185990)

The reason is that the key stored in hash table name_map is a string
address, and the string memory is allocated by realloc() function, when
the memory is resized by realloc() later, the old memory may be freed,
so the address stored in name_map references to a freed memory, causing
use-after-free.

Fix it by storing duplicated string address in name_map.

Fixes: 919d2b1 ("libbpf: Allow modification of BTF and add btf__add_str API")
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Oct 11, 2022
ASAN reports an use-after-free in btf_dump_name_dups:

ERROR: AddressSanitizer: heap-use-after-free on address 0xffff927006db at pc 0xaaaab5dfb618 bp 0xffffdd89b890 sp 0xffffdd89b928
READ of size 2 at 0xffff927006db thread T0
    #0 0xaaaab5dfb614 in __interceptor_strcmp.part.0 (test_progs+0x21b614)
    #1 0xaaaab635f144 in str_equal_fn tools/lib/bpf/btf_dump.c:127
    #2 0xaaaab635e3e0 in hashmap_find_entry tools/lib/bpf/hashmap.c:143
    #3 0xaaaab635e72c in hashmap__find tools/lib/bpf/hashmap.c:212
    #4 0xaaaab6362258 in btf_dump_name_dups tools/lib/bpf/btf_dump.c:1525
    #5 0xaaaab636240c in btf_dump_resolve_name tools/lib/bpf/btf_dump.c:1552
    #6 0xaaaab6362598 in btf_dump_type_name tools/lib/bpf/btf_dump.c:1567
    #7 0xaaaab6360b48 in btf_dump_emit_struct_def tools/lib/bpf/btf_dump.c:912
    #8 0xaaaab6360630 in btf_dump_emit_type tools/lib/bpf/btf_dump.c:798
    #9 0xaaaab635f720 in btf_dump__dump_type tools/lib/bpf/btf_dump.c:282
    #10 0xaaaab608523c in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:236
    #11 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #12 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #13 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #14 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #15 0xaaaab5d65990  (test_progs+0x185990)

0xffff927006db is located 11 bytes inside of 16-byte region [0xffff927006d0,0xffff927006e0)
freed by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353e10 in btf__add_field tools/lib/bpf/btf.c:2032
    #7 0xaaaab6084fcc in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:232
    #8 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #9 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #10 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #11 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #12 0xaaaab5d65990  (test_progs+0x185990)

previously allocated by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353ff0 in btf_add_enum_common tools/lib/bpf/btf.c:2070
    #7 0xaaaab6354080 in btf__add_enum tools/lib/bpf/btf.c:2102
    #8 0xaaaab6082f50 in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:162
    #9 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #10 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #11 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #12 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #13 0xaaaab5d65990  (test_progs+0x185990)

The reason is that the key stored in hash table name_map is a string
address, and the string memory is allocated by realloc() function, when
the memory is resized by realloc() later, the old memory may be freed,
so the address stored in name_map references to a freed memory, causing
use-after-free.

Fix it by storing duplicated string address in name_map.

Fixes: 919d2b1 ("libbpf: Allow modification of BTF and add btf__add_str API")
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Oct 11, 2022
ASAN reports an use-after-free in btf_dump_name_dups:

ERROR: AddressSanitizer: heap-use-after-free on address 0xffff927006db at pc 0xaaaab5dfb618 bp 0xffffdd89b890 sp 0xffffdd89b928
READ of size 2 at 0xffff927006db thread T0
    #0 0xaaaab5dfb614 in __interceptor_strcmp.part.0 (test_progs+0x21b614)
    #1 0xaaaab635f144 in str_equal_fn tools/lib/bpf/btf_dump.c:127
    #2 0xaaaab635e3e0 in hashmap_find_entry tools/lib/bpf/hashmap.c:143
    #3 0xaaaab635e72c in hashmap__find tools/lib/bpf/hashmap.c:212
    #4 0xaaaab6362258 in btf_dump_name_dups tools/lib/bpf/btf_dump.c:1525
    #5 0xaaaab636240c in btf_dump_resolve_name tools/lib/bpf/btf_dump.c:1552
    #6 0xaaaab6362598 in btf_dump_type_name tools/lib/bpf/btf_dump.c:1567
    #7 0xaaaab6360b48 in btf_dump_emit_struct_def tools/lib/bpf/btf_dump.c:912
    #8 0xaaaab6360630 in btf_dump_emit_type tools/lib/bpf/btf_dump.c:798
    #9 0xaaaab635f720 in btf_dump__dump_type tools/lib/bpf/btf_dump.c:282
    #10 0xaaaab608523c in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:236
    #11 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #12 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #13 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #14 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #15 0xaaaab5d65990  (test_progs+0x185990)

0xffff927006db is located 11 bytes inside of 16-byte region [0xffff927006d0,0xffff927006e0)
freed by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353e10 in btf__add_field tools/lib/bpf/btf.c:2032
    #7 0xaaaab6084fcc in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:232
    #8 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #9 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #10 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #11 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #12 0xaaaab5d65990  (test_progs+0x185990)

previously allocated by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353ff0 in btf_add_enum_common tools/lib/bpf/btf.c:2070
    #7 0xaaaab6354080 in btf__add_enum tools/lib/bpf/btf.c:2102
    #8 0xaaaab6082f50 in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:162
    #9 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #10 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #11 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #12 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #13 0xaaaab5d65990  (test_progs+0x185990)

The reason is that the key stored in hash table name_map is a string
address, and the string memory is allocated by realloc() function, when
the memory is resized by realloc() later, the old memory may be freed,
so the address stored in name_map references to a freed memory, causing
use-after-free.

Fix it by storing duplicated string address in name_map.

Fixes: 919d2b1 ("libbpf: Allow modification of BTF and add btf__add_str API")
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Oct 12, 2022
ASAN reports an use-after-free in btf_dump_name_dups:

ERROR: AddressSanitizer: heap-use-after-free on address 0xffff927006db at pc 0xaaaab5dfb618 bp 0xffffdd89b890 sp 0xffffdd89b928
READ of size 2 at 0xffff927006db thread T0
    #0 0xaaaab5dfb614 in __interceptor_strcmp.part.0 (test_progs+0x21b614)
    #1 0xaaaab635f144 in str_equal_fn tools/lib/bpf/btf_dump.c:127
    #2 0xaaaab635e3e0 in hashmap_find_entry tools/lib/bpf/hashmap.c:143
    #3 0xaaaab635e72c in hashmap__find tools/lib/bpf/hashmap.c:212
    #4 0xaaaab6362258 in btf_dump_name_dups tools/lib/bpf/btf_dump.c:1525
    #5 0xaaaab636240c in btf_dump_resolve_name tools/lib/bpf/btf_dump.c:1552
    #6 0xaaaab6362598 in btf_dump_type_name tools/lib/bpf/btf_dump.c:1567
    #7 0xaaaab6360b48 in btf_dump_emit_struct_def tools/lib/bpf/btf_dump.c:912
    #8 0xaaaab6360630 in btf_dump_emit_type tools/lib/bpf/btf_dump.c:798
    #9 0xaaaab635f720 in btf_dump__dump_type tools/lib/bpf/btf_dump.c:282
    #10 0xaaaab608523c in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:236
    #11 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #12 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #13 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #14 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #15 0xaaaab5d65990  (test_progs+0x185990)

0xffff927006db is located 11 bytes inside of 16-byte region [0xffff927006d0,0xffff927006e0)
freed by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353e10 in btf__add_field tools/lib/bpf/btf.c:2032
    #7 0xaaaab6084fcc in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:232
    #8 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #9 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #10 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #11 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #12 0xaaaab5d65990  (test_progs+0x185990)

previously allocated by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353ff0 in btf_add_enum_common tools/lib/bpf/btf.c:2070
    #7 0xaaaab6354080 in btf__add_enum tools/lib/bpf/btf.c:2102
    #8 0xaaaab6082f50 in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:162
    #9 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #10 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #11 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #12 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #13 0xaaaab5d65990  (test_progs+0x185990)

The reason is that the key stored in hash table name_map is a string
address, and the string memory is allocated by realloc() function, when
the memory is resized by realloc() later, the old memory may be freed,
so the address stored in name_map references to a freed memory, causing
use-after-free.

Fix it by storing duplicated string address in name_map.

Fixes: 919d2b1 ("libbpf: Allow modification of BTF and add btf__add_str API")
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Oct 13, 2022
ASAN reports an use-after-free in btf_dump_name_dups:

ERROR: AddressSanitizer: heap-use-after-free on address 0xffff927006db at pc 0xaaaab5dfb618 bp 0xffffdd89b890 sp 0xffffdd89b928
READ of size 2 at 0xffff927006db thread T0
    #0 0xaaaab5dfb614 in __interceptor_strcmp.part.0 (test_progs+0x21b614)
    #1 0xaaaab635f144 in str_equal_fn tools/lib/bpf/btf_dump.c:127
    #2 0xaaaab635e3e0 in hashmap_find_entry tools/lib/bpf/hashmap.c:143
    #3 0xaaaab635e72c in hashmap__find tools/lib/bpf/hashmap.c:212
    #4 0xaaaab6362258 in btf_dump_name_dups tools/lib/bpf/btf_dump.c:1525
    #5 0xaaaab636240c in btf_dump_resolve_name tools/lib/bpf/btf_dump.c:1552
    #6 0xaaaab6362598 in btf_dump_type_name tools/lib/bpf/btf_dump.c:1567
    #7 0xaaaab6360b48 in btf_dump_emit_struct_def tools/lib/bpf/btf_dump.c:912
    #8 0xaaaab6360630 in btf_dump_emit_type tools/lib/bpf/btf_dump.c:798
    #9 0xaaaab635f720 in btf_dump__dump_type tools/lib/bpf/btf_dump.c:282
    #10 0xaaaab608523c in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:236
    #11 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #12 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #13 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #14 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #15 0xaaaab5d65990  (test_progs+0x185990)

0xffff927006db is located 11 bytes inside of 16-byte region [0xffff927006d0,0xffff927006e0)
freed by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353e10 in btf__add_field tools/lib/bpf/btf.c:2032
    #7 0xaaaab6084fcc in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:232
    #8 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #9 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #10 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #11 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #12 0xaaaab5d65990  (test_progs+0x185990)

previously allocated by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353ff0 in btf_add_enum_common tools/lib/bpf/btf.c:2070
    #7 0xaaaab6354080 in btf__add_enum tools/lib/bpf/btf.c:2102
    #8 0xaaaab6082f50 in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:162
    #9 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #10 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #11 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #12 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #13 0xaaaab5d65990  (test_progs+0x185990)

The reason is that the key stored in hash table name_map is a string
address, and the string memory is allocated by realloc() function, when
the memory is resized by realloc() later, the old memory may be freed,
so the address stored in name_map references to a freed memory, causing
use-after-free.

Fix it by storing duplicated string address in name_map.

Fixes: 919d2b1 ("libbpf: Allow modification of BTF and add btf__add_str API")
Signed-off-by: Xu Kuohai <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 14, 2025
Under PREEMPT_RT, it is not safe to use GPF_ATOMIC kmalloc when
preemption or irq is disabled. The following warning is reported when
running test_progs under PREEMPT_RT:

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 675, name: test_progs
  preempt_count: 1, expected: 0
  RCU nest depth: 0, expected: 0
  2 locks held by test_progs/675:
   #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
   #1: ffff8881f4ec40c8 ((&c->lock)){....}-{2:2}, at: ___slab_alloc+0xbc/0x1280
  Preemption disabled at:
  [<ffffffff8175ae2b>] __bpf_async_init+0xbb/0xb10
  CPU: 1 UID: 0 PID: 675 Comm: test_progs Tainted: G           O       6.12.0+ #11
  Tainted: [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x57/0x70
   dump_stack+0x10/0x20
   __might_resched+0x337/0x4d0
   rt_spin_lock+0xd4/0x230
   ___slab_alloc+0xbc/0x1280
   __slab_alloc.isra.0+0x5d/0xa0
   __kmalloc_node_noprof+0xf7/0x4f0
   bpf_map_kmalloc_node+0xf5/0x6b0
   __bpf_async_init+0x20e/0xb10
   bpf_timer_init+0x30/0x40
   bpf_prog_c7e2dc9ff3d5ba62_start_cb+0x55/0x85
   bpf_prog_4eb421be69ae82fa_start_timer+0x5d/0x7e
   bpf_prog_test_run_syscall+0x322/0x830
   __sys_bpf+0x135d/0x3ca0
   __x64_sys_bpf+0x75/0xb0
   x64_sys_call+0x1b5/0xa10
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fix the problem by using bpf_global_ma to allocate bpf_async_cb when
PREEMPT_RT is enabled. The reason for still using kmalloc for
no-PREEMPT_RT case is that bpf_global_ma doesn't support accouting the
allocated memory to specific memcg. Also doing the memory allocation
before invoking __bpf_spin_lock_irqsave() to reduce the possibility of
-ENOMEM for bpf_global_ma.

Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 14, 2025
During the update procedure, when overwrite element in a pre-allocated
htab, the freeing of old_element is protected by the bucket lock. The
reason why the bucket lock is necessary is that the old_element has
already been stashed in htab->extra_elems after alloc_htab_elem()
returns. If freeing the old_element after the bucket lock is unlocked,
the stashed element may be reused by concurrent update procedure and the
freeing of old_element will run concurrently with the reuse of the
old_element. However, the invocation of check_and_free_fields() may
acquire a spin-lock which violates the lockdep rule because its caller
has already held a raw-spin-lock (bucket lock). The following warning
will be reported when such race happens:

  BUG: scheduling while atomic: test_progs/676/0x00000003
  3 locks held by test_progs/676:
  #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
  #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
  #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
  Modules linked in: bpf_testmod(O)
  Preemption disabled at:
  [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
  CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
  Tainted: [W]=WARN, [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
  Call Trace:
  <TASK>
  dump_stack_lvl+0x57/0x70
  dump_stack+0x10/0x20
  __schedule_bug+0x120/0x170
  __schedule+0x300c/0x4800
  schedule_rtlock+0x37/0x60
  rtlock_slowlock_locked+0x6d9/0x54c0
  rt_spin_lock+0x168/0x230
  hrtimer_cancel_wait_running+0xe9/0x1b0
  hrtimer_cancel+0x24/0x30
  bpf_timer_delete_work+0x1d/0x40
  bpf_timer_cancel_and_free+0x5e/0x80
  bpf_obj_free_fields+0x262/0x4a0
  check_and_free_fields+0x1d0/0x280
  htab_map_update_elem+0x7fc/0x1500
  bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43
  bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e
  bpf_prog_test_run_syscall+0x322/0x830
  __sys_bpf+0x135d/0x3ca0
  __x64_sys_bpf+0x75/0xb0
  x64_sys_call+0x1b5/0xa10
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  ...
  </TASK>

It seems feasible to break the reuse and refill of per-cpu extra_elems
into two independent parts: reuse the per-cpu extra_elems with bucket
lock being held and refill the old_element as per-cpu extra_elems after
the bucket lock is unlocked. However, it will make the concurrent
overwrite procedures on the same CPU return unexpected -E2BIG error when
the map is full.

Therefore, the patch fixes the lock problem by breaking the cancelling
of bpf_timer into two steps:
1) use hrtimer_try_to_cancel() and check its return value
2) if the timer is running, use hrtimer_cancel() through a kworker to
   cancel it again
Considering that the current implementation of hrtimer_cancel() will try
to spin on current CPU or acquire a being held softirq_expiry_lock
when the current timer is running, these steps above are reasonable.
However, it also has downside. When the timer is running, the cancelling
of the timer is delayed when releasing the last map uref. The delay is
also fixable (e.g., break the cancelling of bpf timer into two parts:
one part in locked scope, another one in unlocked scope), so it can be
revised later if necessary.

It is a bit hard to decide the right fix tag. One reason is that the
problem depends on PREEMPT_RT which is enabled in v6.12. Considering the
softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced
in v5.15, the bpf_timer commit is used in the fixes tag and an extra
depends-on tag is added to state the dependency on PREEMPT_RT.

Fixes: b00628b ("bpf: Introduce bpf timers.")
Depends-on: v6.12 with PREEMPT_RT enabled
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Closes: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 14, 2025
Under PREEMPT_RT, it is not safe to use GPF_ATOMIC kmalloc when
preemption or irq is disabled. The following warning is reported when
running test_progs under PREEMPT_RT:

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 675, name: test_progs
  preempt_count: 1, expected: 0
  RCU nest depth: 0, expected: 0
  2 locks held by test_progs/675:
   #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
   #1: ffff8881f4ec40c8 ((&c->lock)){....}-{2:2}, at: ___slab_alloc+0xbc/0x1280
  Preemption disabled at:
  [<ffffffff8175ae2b>] __bpf_async_init+0xbb/0xb10
  CPU: 1 UID: 0 PID: 675 Comm: test_progs Tainted: G           O       6.12.0+ #11
  Tainted: [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x57/0x70
   dump_stack+0x10/0x20
   __might_resched+0x337/0x4d0
   rt_spin_lock+0xd4/0x230
   ___slab_alloc+0xbc/0x1280
   __slab_alloc.isra.0+0x5d/0xa0
   __kmalloc_node_noprof+0xf7/0x4f0
   bpf_map_kmalloc_node+0xf5/0x6b0
   __bpf_async_init+0x20e/0xb10
   bpf_timer_init+0x30/0x40
   bpf_prog_c7e2dc9ff3d5ba62_start_cb+0x55/0x85
   bpf_prog_4eb421be69ae82fa_start_timer+0x5d/0x7e
   bpf_prog_test_run_syscall+0x322/0x830
   __sys_bpf+0x135d/0x3ca0
   __x64_sys_bpf+0x75/0xb0
   x64_sys_call+0x1b5/0xa10
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fix the problem by using bpf_global_ma to allocate bpf_async_cb when
PREEMPT_RT is enabled. The reason for still using kmalloc for
no-PREEMPT_RT case is that bpf_global_ma doesn't support accouting the
allocated memory to specific memcg. Also doing the memory allocation
before invoking __bpf_spin_lock_irqsave() to reduce the possibility of
-ENOMEM for bpf_global_ma.

Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 15, 2025
Under PREEMPT_RT, it is not safe to use GPF_ATOMIC kmalloc when
preemption or irq is disabled. The following warning is reported when
running test_progs under PREEMPT_RT:

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 675, name: test_progs
  preempt_count: 1, expected: 0
  RCU nest depth: 0, expected: 0
  2 locks held by test_progs/675:
   #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
   #1: ffff8881f4ec40c8 ((&c->lock)){....}-{2:2}, at: ___slab_alloc+0xbc/0x1280
  Preemption disabled at:
  [<ffffffff8175ae2b>] __bpf_async_init+0xbb/0xb10
  CPU: 1 UID: 0 PID: 675 Comm: test_progs Tainted: G           O       6.12.0+ #11
  Tainted: [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x57/0x70
   dump_stack+0x10/0x20
   __might_resched+0x337/0x4d0
   rt_spin_lock+0xd4/0x230
   ___slab_alloc+0xbc/0x1280
   __slab_alloc.isra.0+0x5d/0xa0
   __kmalloc_node_noprof+0xf7/0x4f0
   bpf_map_kmalloc_node+0xf5/0x6b0
   __bpf_async_init+0x20e/0xb10
   bpf_timer_init+0x30/0x40
   bpf_prog_c7e2dc9ff3d5ba62_start_cb+0x55/0x85
   bpf_prog_4eb421be69ae82fa_start_timer+0x5d/0x7e
   bpf_prog_test_run_syscall+0x322/0x830
   __sys_bpf+0x135d/0x3ca0
   __x64_sys_bpf+0x75/0xb0
   x64_sys_call+0x1b5/0xa10
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fix the problem by using bpf_global_ma to allocate bpf_async_cb when
PREEMPT_RT is enabled. The reason for still using kmalloc for
no-PREEMPT_RT case is that bpf_global_ma doesn't support accouting the
allocated memory to specific memcg. Also doing the memory allocation
before invoking __bpf_spin_lock_irqsave() to reduce the possibility of
-ENOMEM for bpf_global_ma.

Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 15, 2025
Under PREEMPT_RT, it is not safe to use GPF_ATOMIC kmalloc when
preemption or irq is disabled. The following warning is reported when
running test_progs under PREEMPT_RT:

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 675, name: test_progs
  preempt_count: 1, expected: 0
  RCU nest depth: 0, expected: 0
  2 locks held by test_progs/675:
   #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
   #1: ffff8881f4ec40c8 ((&c->lock)){....}-{2:2}, at: ___slab_alloc+0xbc/0x1280
  Preemption disabled at:
  [<ffffffff8175ae2b>] __bpf_async_init+0xbb/0xb10
  CPU: 1 UID: 0 PID: 675 Comm: test_progs Tainted: G           O       6.12.0+ #11
  Tainted: [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x57/0x70
   dump_stack+0x10/0x20
   __might_resched+0x337/0x4d0
   rt_spin_lock+0xd4/0x230
   ___slab_alloc+0xbc/0x1280
   __slab_alloc.isra.0+0x5d/0xa0
   __kmalloc_node_noprof+0xf7/0x4f0
   bpf_map_kmalloc_node+0xf5/0x6b0
   __bpf_async_init+0x20e/0xb10
   bpf_timer_init+0x30/0x40
   bpf_prog_c7e2dc9ff3d5ba62_start_cb+0x55/0x85
   bpf_prog_4eb421be69ae82fa_start_timer+0x5d/0x7e
   bpf_prog_test_run_syscall+0x322/0x830
   __sys_bpf+0x135d/0x3ca0
   __x64_sys_bpf+0x75/0xb0
   x64_sys_call+0x1b5/0xa10
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fix the problem by using bpf_global_ma to allocate bpf_async_cb when
PREEMPT_RT is enabled. The reason for still using kmalloc for
no-PREEMPT_RT case is that bpf_global_ma doesn't support accouting the
allocated memory to specific memcg. Also doing the memory allocation
before invoking __bpf_spin_lock_irqsave() to reduce the possibility of
-ENOMEM for bpf_global_ma.

Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 16, 2025
Under PREEMPT_RT, it is not safe to use GPF_ATOMIC kmalloc when
preemption or irq is disabled. The following warning is reported when
running test_progs under PREEMPT_RT:

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 675, name: test_progs
  preempt_count: 1, expected: 0
  RCU nest depth: 0, expected: 0
  2 locks held by test_progs/675:
   #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
   #1: ffff8881f4ec40c8 ((&c->lock)){....}-{2:2}, at: ___slab_alloc+0xbc/0x1280
  Preemption disabled at:
  [<ffffffff8175ae2b>] __bpf_async_init+0xbb/0xb10
  CPU: 1 UID: 0 PID: 675 Comm: test_progs Tainted: G           O       6.12.0+ #11
  Tainted: [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x57/0x70
   dump_stack+0x10/0x20
   __might_resched+0x337/0x4d0
   rt_spin_lock+0xd4/0x230
   ___slab_alloc+0xbc/0x1280
   __slab_alloc.isra.0+0x5d/0xa0
   __kmalloc_node_noprof+0xf7/0x4f0
   bpf_map_kmalloc_node+0xf5/0x6b0
   __bpf_async_init+0x20e/0xb10
   bpf_timer_init+0x30/0x40
   bpf_prog_c7e2dc9ff3d5ba62_start_cb+0x55/0x85
   bpf_prog_4eb421be69ae82fa_start_timer+0x5d/0x7e
   bpf_prog_test_run_syscall+0x322/0x830
   __sys_bpf+0x135d/0x3ca0
   __x64_sys_bpf+0x75/0xb0
   x64_sys_call+0x1b5/0xa10
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fix the problem by using bpf_global_ma to allocate bpf_async_cb when
PREEMPT_RT is enabled. The reason for still using kmalloc for
no-PREEMPT_RT case is that bpf_global_ma doesn't support accouting the
allocated memory to specific memcg. Also doing the memory allocation
before invoking __bpf_spin_lock_irqsave() to reduce the possibility of
-ENOMEM for bpf_global_ma.

Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 16, 2025
Under PREEMPT_RT, it is not safe to use GPF_ATOMIC kmalloc when
preemption or irq is disabled. The following warning is reported when
running test_progs under PREEMPT_RT:

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 675, name: test_progs
  preempt_count: 1, expected: 0
  RCU nest depth: 0, expected: 0
  2 locks held by test_progs/675:
   #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
   #1: ffff8881f4ec40c8 ((&c->lock)){....}-{2:2}, at: ___slab_alloc+0xbc/0x1280
  Preemption disabled at:
  [<ffffffff8175ae2b>] __bpf_async_init+0xbb/0xb10
  CPU: 1 UID: 0 PID: 675 Comm: test_progs Tainted: G           O       6.12.0+ #11
  Tainted: [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x57/0x70
   dump_stack+0x10/0x20
   __might_resched+0x337/0x4d0
   rt_spin_lock+0xd4/0x230
   ___slab_alloc+0xbc/0x1280
   __slab_alloc.isra.0+0x5d/0xa0
   __kmalloc_node_noprof+0xf7/0x4f0
   bpf_map_kmalloc_node+0xf5/0x6b0
   __bpf_async_init+0x20e/0xb10
   bpf_timer_init+0x30/0x40
   bpf_prog_c7e2dc9ff3d5ba62_start_cb+0x55/0x85
   bpf_prog_4eb421be69ae82fa_start_timer+0x5d/0x7e
   bpf_prog_test_run_syscall+0x322/0x830
   __sys_bpf+0x135d/0x3ca0
   __x64_sys_bpf+0x75/0xb0
   x64_sys_call+0x1b5/0xa10
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fix the problem by using bpf_global_ma to allocate bpf_async_cb when
PREEMPT_RT is enabled. The reason for still using kmalloc for
no-PREEMPT_RT case is that bpf_global_ma doesn't support accouting the
allocated memory to specific memcg. Also doing the memory allocation
before invoking __bpf_spin_lock_irqsave() to reduce the possibility of
-ENOMEM for bpf_global_ma.

Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 16, 2025
Under PREEMPT_RT, it is not safe to use GPF_ATOMIC kmalloc when
preemption or irq is disabled. The following warning is reported when
running test_progs under PREEMPT_RT:

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 675, name: test_progs
  preempt_count: 1, expected: 0
  RCU nest depth: 0, expected: 0
  2 locks held by test_progs/675:
   #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
   #1: ffff8881f4ec40c8 ((&c->lock)){....}-{2:2}, at: ___slab_alloc+0xbc/0x1280
  Preemption disabled at:
  [<ffffffff8175ae2b>] __bpf_async_init+0xbb/0xb10
  CPU: 1 UID: 0 PID: 675 Comm: test_progs Tainted: G           O       6.12.0+ #11
  Tainted: [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x57/0x70
   dump_stack+0x10/0x20
   __might_resched+0x337/0x4d0
   rt_spin_lock+0xd4/0x230
   ___slab_alloc+0xbc/0x1280
   __slab_alloc.isra.0+0x5d/0xa0
   __kmalloc_node_noprof+0xf7/0x4f0
   bpf_map_kmalloc_node+0xf5/0x6b0
   __bpf_async_init+0x20e/0xb10
   bpf_timer_init+0x30/0x40
   bpf_prog_c7e2dc9ff3d5ba62_start_cb+0x55/0x85
   bpf_prog_4eb421be69ae82fa_start_timer+0x5d/0x7e
   bpf_prog_test_run_syscall+0x322/0x830
   __sys_bpf+0x135d/0x3ca0
   __x64_sys_bpf+0x75/0xb0
   x64_sys_call+0x1b5/0xa10
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fix the problem by using bpf_global_ma to allocate bpf_async_cb when
PREEMPT_RT is enabled. The reason for still using kmalloc for
no-PREEMPT_RT case is that bpf_global_ma doesn't support accouting the
allocated memory to specific memcg. Also doing the memory allocation
before invoking __bpf_spin_lock_irqsave() to reduce the possibility of
-ENOMEM for bpf_global_ma.

Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 16, 2025
Under PREEMPT_RT, it is not safe to use GPF_ATOMIC kmalloc when
preemption or irq is disabled. The following warning is reported when
running test_progs under PREEMPT_RT:

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 675, name: test_progs
  preempt_count: 1, expected: 0
  RCU nest depth: 0, expected: 0
  2 locks held by test_progs/675:
   #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
   #1: ffff8881f4ec40c8 ((&c->lock)){....}-{2:2}, at: ___slab_alloc+0xbc/0x1280
  Preemption disabled at:
  [<ffffffff8175ae2b>] __bpf_async_init+0xbb/0xb10
  CPU: 1 UID: 0 PID: 675 Comm: test_progs Tainted: G           O       6.12.0+ #11
  Tainted: [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x57/0x70
   dump_stack+0x10/0x20
   __might_resched+0x337/0x4d0
   rt_spin_lock+0xd4/0x230
   ___slab_alloc+0xbc/0x1280
   __slab_alloc.isra.0+0x5d/0xa0
   __kmalloc_node_noprof+0xf7/0x4f0
   bpf_map_kmalloc_node+0xf5/0x6b0
   __bpf_async_init+0x20e/0xb10
   bpf_timer_init+0x30/0x40
   bpf_prog_c7e2dc9ff3d5ba62_start_cb+0x55/0x85
   bpf_prog_4eb421be69ae82fa_start_timer+0x5d/0x7e
   bpf_prog_test_run_syscall+0x322/0x830
   __sys_bpf+0x135d/0x3ca0
   __x64_sys_bpf+0x75/0xb0
   x64_sys_call+0x1b5/0xa10
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fix the problem by using bpf_global_ma to allocate bpf_async_cb when
PREEMPT_RT is enabled. The reason for still using kmalloc for
no-PREEMPT_RT case is that bpf_global_ma doesn't support accouting the
allocated memory to specific memcg. Also doing the memory allocation
before invoking __bpf_spin_lock_irqsave() to reduce the possibility of
-ENOMEM for bpf_global_ma.

Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 17, 2025
Under PREEMPT_RT, it is not safe to use GPF_ATOMIC kmalloc when
preemption or irq is disabled. The following warning is reported when
running test_progs under PREEMPT_RT:

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 675, name: test_progs
  preempt_count: 1, expected: 0
  RCU nest depth: 0, expected: 0
  2 locks held by test_progs/675:
   #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
   #1: ffff8881f4ec40c8 ((&c->lock)){....}-{2:2}, at: ___slab_alloc+0xbc/0x1280
  Preemption disabled at:
  [<ffffffff8175ae2b>] __bpf_async_init+0xbb/0xb10
  CPU: 1 UID: 0 PID: 675 Comm: test_progs Tainted: G           O       6.12.0+ #11
  Tainted: [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x57/0x70
   dump_stack+0x10/0x20
   __might_resched+0x337/0x4d0
   rt_spin_lock+0xd4/0x230
   ___slab_alloc+0xbc/0x1280
   __slab_alloc.isra.0+0x5d/0xa0
   __kmalloc_node_noprof+0xf7/0x4f0
   bpf_map_kmalloc_node+0xf5/0x6b0
   __bpf_async_init+0x20e/0xb10
   bpf_timer_init+0x30/0x40
   bpf_prog_c7e2dc9ff3d5ba62_start_cb+0x55/0x85
   bpf_prog_4eb421be69ae82fa_start_timer+0x5d/0x7e
   bpf_prog_test_run_syscall+0x322/0x830
   __sys_bpf+0x135d/0x3ca0
   __x64_sys_bpf+0x75/0xb0
   x64_sys_call+0x1b5/0xa10
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fix the problem by using bpf_global_ma to allocate bpf_async_cb when
PREEMPT_RT is enabled. The reason for still using kmalloc for
no-PREEMPT_RT case is that bpf_global_ma doesn't support accouting the
allocated memory to specific memcg. Also doing the memory allocation
before invoking __bpf_spin_lock_irqsave() to reduce the possibility of
-ENOMEM for bpf_global_ma.

Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 17, 2025
Under PREEMPT_RT, it is not safe to use GPF_ATOMIC kmalloc when
preemption or irq is disabled. The following warning is reported when
running test_progs under PREEMPT_RT:

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 675, name: test_progs
  preempt_count: 1, expected: 0
  RCU nest depth: 0, expected: 0
  2 locks held by test_progs/675:
   #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
   #1: ffff8881f4ec40c8 ((&c->lock)){....}-{2:2}, at: ___slab_alloc+0xbc/0x1280
  Preemption disabled at:
  [<ffffffff8175ae2b>] __bpf_async_init+0xbb/0xb10
  CPU: 1 UID: 0 PID: 675 Comm: test_progs Tainted: G           O       6.12.0+ #11
  Tainted: [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x57/0x70
   dump_stack+0x10/0x20
   __might_resched+0x337/0x4d0
   rt_spin_lock+0xd4/0x230
   ___slab_alloc+0xbc/0x1280
   __slab_alloc.isra.0+0x5d/0xa0
   __kmalloc_node_noprof+0xf7/0x4f0
   bpf_map_kmalloc_node+0xf5/0x6b0
   __bpf_async_init+0x20e/0xb10
   bpf_timer_init+0x30/0x40
   bpf_prog_c7e2dc9ff3d5ba62_start_cb+0x55/0x85
   bpf_prog_4eb421be69ae82fa_start_timer+0x5d/0x7e
   bpf_prog_test_run_syscall+0x322/0x830
   __sys_bpf+0x135d/0x3ca0
   __x64_sys_bpf+0x75/0xb0
   x64_sys_call+0x1b5/0xa10
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fix the problem by using bpf_global_ma to allocate bpf_async_cb when
PREEMPT_RT is enabled. The reason for still using kmalloc for
no-PREEMPT_RT case is that bpf_global_ma doesn't support accouting the
allocated memory to specific memcg. Also doing the memory allocation
before invoking __bpf_spin_lock_irqsave() to reduce the possibility of
-ENOMEM for bpf_global_ma.

Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 17, 2025
During the update procedure, when overwrite element in a pre-allocated
htab, the freeing of old_element is protected by the bucket lock. The
reason why the bucket lock is necessary is that the old_element has
already been stashed in htab->extra_elems after alloc_htab_elem()
returns. If freeing the old_element after the bucket lock is unlocked,
the stashed element may be reused by concurrent update procedure and the
freeing of old_element will run concurrently with the reuse of the
old_element. However, the invocation of check_and_free_fields() may
acquire a spin-lock which violates the lockdep rule because its caller
has already held a raw-spin-lock (bucket lock). The following warning
will be reported when such race happens:

  BUG: scheduling while atomic: test_progs/676/0x00000003
  3 locks held by test_progs/676:
  #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
  #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
  #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
  Modules linked in: bpf_testmod(O)
  Preemption disabled at:
  [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
  CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
  Tainted: [W]=WARN, [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
  Call Trace:
  <TASK>
  dump_stack_lvl+0x57/0x70
  dump_stack+0x10/0x20
  __schedule_bug+0x120/0x170
  __schedule+0x300c/0x4800
  schedule_rtlock+0x37/0x60
  rtlock_slowlock_locked+0x6d9/0x54c0
  rt_spin_lock+0x168/0x230
  hrtimer_cancel_wait_running+0xe9/0x1b0
  hrtimer_cancel+0x24/0x30
  bpf_timer_delete_work+0x1d/0x40
  bpf_timer_cancel_and_free+0x5e/0x80
  bpf_obj_free_fields+0x262/0x4a0
  check_and_free_fields+0x1d0/0x280
  htab_map_update_elem+0x7fc/0x1500
  bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43
  bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e
  bpf_prog_test_run_syscall+0x322/0x830
  __sys_bpf+0x135d/0x3ca0
  __x64_sys_bpf+0x75/0xb0
  x64_sys_call+0x1b5/0xa10
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  ...
  </TASK>

It seems feasible to break the reuse and refill of per-cpu extra_elems
into two independent parts: reuse the per-cpu extra_elems with bucket
lock being held and refill the old_element as per-cpu extra_elems after
the bucket lock is unlocked. However, it will make the concurrent
overwrite procedures on the same CPU return unexpected -E2BIG error when
the map is full.

Therefore, the patch fixes the lock problem by breaking the cancelling
of bpf_timer into two steps for PREEMPT_RT:
1) use hrtimer_try_to_cancel() and check its return value
2) if the timer is running, use hrtimer_cancel() through a kworker to
   cancel it again
Considering that the current implementation of hrtimer_cancel() will try
to acquire a being held softirq_expiry_lock when the current timer is
running, these steps above are reasonable. However, it also has
downside. When the timer is running, the cancelling of the timer is
delayed when releasing the last map uref. The delay is also fixable
(e.g., break the cancelling of bpf timer into two parts: one part in
locked scope, another one in unlocked scope), it can be revised later if
necessary.

It is a bit hard to decide the right fix tag. One reason is that the
problem depends on PREEMPT_RT which is enabled in v6.12. Considering the
softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced
in v5.15, the bpf_timer commit is used in the fixes tag and an extra
depends-on tag is added to state the dependency on PREEMPT_RT.

Fixes: b00628b ("bpf: Introduce bpf timers.")
Depends-on: v6.12+ with PREEMPT_RT enabled
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Closes: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 17, 2025
During the update procedure, when overwrite element in a pre-allocated
htab, the freeing of old_element is protected by the bucket lock. The
reason why the bucket lock is necessary is that the old_element has
already been stashed in htab->extra_elems after alloc_htab_elem()
returns. If freeing the old_element after the bucket lock is unlocked,
the stashed element may be reused by concurrent update procedure and the
freeing of old_element will run concurrently with the reuse of the
old_element. However, the invocation of check_and_free_fields() may
acquire a spin-lock which violates the lockdep rule because its caller
has already held a raw-spin-lock (bucket lock). The following warning
will be reported when such race happens:

  BUG: scheduling while atomic: test_progs/676/0x00000003
  3 locks held by test_progs/676:
  #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
  #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
  #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
  Modules linked in: bpf_testmod(O)
  Preemption disabled at:
  [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
  CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
  Tainted: [W]=WARN, [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
  Call Trace:
  <TASK>
  dump_stack_lvl+0x57/0x70
  dump_stack+0x10/0x20
  __schedule_bug+0x120/0x170
  __schedule+0x300c/0x4800
  schedule_rtlock+0x37/0x60
  rtlock_slowlock_locked+0x6d9/0x54c0
  rt_spin_lock+0x168/0x230
  hrtimer_cancel_wait_running+0xe9/0x1b0
  hrtimer_cancel+0x24/0x30
  bpf_timer_delete_work+0x1d/0x40
  bpf_timer_cancel_and_free+0x5e/0x80
  bpf_obj_free_fields+0x262/0x4a0
  check_and_free_fields+0x1d0/0x280
  htab_map_update_elem+0x7fc/0x1500
  bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43
  bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e
  bpf_prog_test_run_syscall+0x322/0x830
  __sys_bpf+0x135d/0x3ca0
  __x64_sys_bpf+0x75/0xb0
  x64_sys_call+0x1b5/0xa10
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  ...
  </TASK>

It seems feasible to break the reuse and refill of per-cpu extra_elems
into two independent parts: reuse the per-cpu extra_elems with bucket
lock being held and refill the old_element as per-cpu extra_elems after
the bucket lock is unlocked. However, it will make the concurrent
overwrite procedures on the same CPU return unexpected -E2BIG error when
the map is full.

Therefore, the patch fixes the lock problem by breaking the cancelling
of bpf_timer into two steps for PREEMPT_RT:
1) use hrtimer_try_to_cancel() and check its return value
2) if the timer is running, use hrtimer_cancel() through a kworker to
   cancel it again
Considering that the current implementation of hrtimer_cancel() will try
to acquire a being held softirq_expiry_lock when the current timer is
running, these steps above are reasonable. However, it also has
downside. When the timer is running, the cancelling of the timer is
delayed when releasing the last map uref. The delay is also fixable
(e.g., break the cancelling of bpf timer into two parts: one part in
locked scope, another one in unlocked scope), it can be revised later if
necessary.

It is a bit hard to decide the right fix tag. One reason is that the
problem depends on PREEMPT_RT which is enabled in v6.12. Considering the
softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced
in v5.15, the bpf_timer commit is used in the fixes tag and an extra
depends-on tag is added to state the dependency on PREEMPT_RT.

Fixes: b00628b ("bpf: Introduce bpf timers.")
Depends-on: v6.12+ with PREEMPT_RT enabled
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Closes: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 17, 2025
During the update procedure, when overwrite element in a pre-allocated
htab, the freeing of old_element is protected by the bucket lock. The
reason why the bucket lock is necessary is that the old_element has
already been stashed in htab->extra_elems after alloc_htab_elem()
returns. If freeing the old_element after the bucket lock is unlocked,
the stashed element may be reused by concurrent update procedure and the
freeing of old_element will run concurrently with the reuse of the
old_element. However, the invocation of check_and_free_fields() may
acquire a spin-lock which violates the lockdep rule because its caller
has already held a raw-spin-lock (bucket lock). The following warning
will be reported when such race happens:

  BUG: scheduling while atomic: test_progs/676/0x00000003
  3 locks held by test_progs/676:
  #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
  #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
  #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
  Modules linked in: bpf_testmod(O)
  Preemption disabled at:
  [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
  CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
  Tainted: [W]=WARN, [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
  Call Trace:
  <TASK>
  dump_stack_lvl+0x57/0x70
  dump_stack+0x10/0x20
  __schedule_bug+0x120/0x170
  __schedule+0x300c/0x4800
  schedule_rtlock+0x37/0x60
  rtlock_slowlock_locked+0x6d9/0x54c0
  rt_spin_lock+0x168/0x230
  hrtimer_cancel_wait_running+0xe9/0x1b0
  hrtimer_cancel+0x24/0x30
  bpf_timer_delete_work+0x1d/0x40
  bpf_timer_cancel_and_free+0x5e/0x80
  bpf_obj_free_fields+0x262/0x4a0
  check_and_free_fields+0x1d0/0x280
  htab_map_update_elem+0x7fc/0x1500
  bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43
  bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e
  bpf_prog_test_run_syscall+0x322/0x830
  __sys_bpf+0x135d/0x3ca0
  __x64_sys_bpf+0x75/0xb0
  x64_sys_call+0x1b5/0xa10
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  ...
  </TASK>

It seems feasible to break the reuse and refill of per-cpu extra_elems
into two independent parts: reuse the per-cpu extra_elems with bucket
lock being held and refill the old_element as per-cpu extra_elems after
the bucket lock is unlocked. However, it will make the concurrent
overwrite procedures on the same CPU return unexpected -E2BIG error when
the map is full.

Therefore, the patch fixes the lock problem by breaking the cancelling
of bpf_timer into two steps for PREEMPT_RT:
1) use hrtimer_try_to_cancel() and check its return value
2) if the timer is running, use hrtimer_cancel() through a kworker to
   cancel it again
Considering that the current implementation of hrtimer_cancel() will try
to acquire a being held softirq_expiry_lock when the current timer is
running, these steps above are reasonable. However, it also has
downside. When the timer is running, the cancelling of the timer is
delayed when releasing the last map uref. The delay is also fixable
(e.g., break the cancelling of bpf timer into two parts: one part in
locked scope, another one in unlocked scope), it can be revised later if
necessary.

It is a bit hard to decide the right fix tag. One reason is that the
problem depends on PREEMPT_RT which is enabled in v6.12. Considering the
softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced
in v5.15, the bpf_timer commit is used in the fixes tag and an extra
depends-on tag is added to state the dependency on PREEMPT_RT.

Fixes: b00628b ("bpf: Introduce bpf timers.")
Depends-on: v6.12+ with PREEMPT_RT enabled
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Closes: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Hou Tao <[email protected]>
Reviewed-by: Toke Høiland-Jørgensen <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 17, 2025
During the update procedure, when overwrite element in a pre-allocated
htab, the freeing of old_element is protected by the bucket lock. The
reason why the bucket lock is necessary is that the old_element has
already been stashed in htab->extra_elems after alloc_htab_elem()
returns. If freeing the old_element after the bucket lock is unlocked,
the stashed element may be reused by concurrent update procedure and the
freeing of old_element will run concurrently with the reuse of the
old_element. However, the invocation of check_and_free_fields() may
acquire a spin-lock which violates the lockdep rule because its caller
has already held a raw-spin-lock (bucket lock). The following warning
will be reported when such race happens:

  BUG: scheduling while atomic: test_progs/676/0x00000003
  3 locks held by test_progs/676:
  #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
  #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
  #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
  Modules linked in: bpf_testmod(O)
  Preemption disabled at:
  [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
  CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
  Tainted: [W]=WARN, [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
  Call Trace:
  <TASK>
  dump_stack_lvl+0x57/0x70
  dump_stack+0x10/0x20
  __schedule_bug+0x120/0x170
  __schedule+0x300c/0x4800
  schedule_rtlock+0x37/0x60
  rtlock_slowlock_locked+0x6d9/0x54c0
  rt_spin_lock+0x168/0x230
  hrtimer_cancel_wait_running+0xe9/0x1b0
  hrtimer_cancel+0x24/0x30
  bpf_timer_delete_work+0x1d/0x40
  bpf_timer_cancel_and_free+0x5e/0x80
  bpf_obj_free_fields+0x262/0x4a0
  check_and_free_fields+0x1d0/0x280
  htab_map_update_elem+0x7fc/0x1500
  bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43
  bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e
  bpf_prog_test_run_syscall+0x322/0x830
  __sys_bpf+0x135d/0x3ca0
  __x64_sys_bpf+0x75/0xb0
  x64_sys_call+0x1b5/0xa10
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  ...
  </TASK>

It seems feasible to break the reuse and refill of per-cpu extra_elems
into two independent parts: reuse the per-cpu extra_elems with bucket
lock being held and refill the old_element as per-cpu extra_elems after
the bucket lock is unlocked. However, it will make the concurrent
overwrite procedures on the same CPU return unexpected -E2BIG error when
the map is full.

Therefore, the patch fixes the lock problem by breaking the cancelling
of bpf_timer into two steps for PREEMPT_RT:
1) use hrtimer_try_to_cancel() and check its return value
2) if the timer is running, use hrtimer_cancel() through a kworker to
   cancel it again
Considering that the current implementation of hrtimer_cancel() will try
to acquire a being held softirq_expiry_lock when the current timer is
running, these steps above are reasonable. However, it also has
downside. When the timer is running, the cancelling of the timer is
delayed when releasing the last map uref. The delay is also fixable
(e.g., break the cancelling of bpf timer into two parts: one part in
locked scope, another one in unlocked scope), it can be revised later if
necessary.

It is a bit hard to decide the right fix tag. One reason is that the
problem depends on PREEMPT_RT which is enabled in v6.12. Considering the
softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced
in v5.15, the bpf_timer commit is used in the fixes tag and an extra
depends-on tag is added to state the dependency on PREEMPT_RT.

Fixes: b00628b ("bpf: Introduce bpf timers.")
Depends-on: v6.12+ with PREEMPT_RT enabled
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Closes: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Hou Tao <[email protected]>
Reviewed-by: Toke Høiland-Jørgensen <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 17, 2025
Under PREEMPT_RT, it is not safe to use GPF_ATOMIC kmalloc when
preemption or irq is disabled. The following warning is reported when
running test_progs under PREEMPT_RT:

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 675, name: test_progs
  preempt_count: 1, expected: 0
  RCU nest depth: 0, expected: 0
  2 locks held by test_progs/675:
   #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
   #1: ffff8881f4ec40c8 ((&c->lock)){....}-{2:2}, at: ___slab_alloc+0xbc/0x1280
  Preemption disabled at:
  [<ffffffff8175ae2b>] __bpf_async_init+0xbb/0xb10
  CPU: 1 UID: 0 PID: 675 Comm: test_progs Tainted: G           O       6.12.0+ #11
  Tainted: [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x57/0x70
   dump_stack+0x10/0x20
   __might_resched+0x337/0x4d0
   rt_spin_lock+0xd4/0x230
   ___slab_alloc+0xbc/0x1280
   __slab_alloc.isra.0+0x5d/0xa0
   __kmalloc_node_noprof+0xf7/0x4f0
   bpf_map_kmalloc_node+0xf5/0x6b0
   __bpf_async_init+0x20e/0xb10
   bpf_timer_init+0x30/0x40
   bpf_prog_c7e2dc9ff3d5ba62_start_cb+0x55/0x85
   bpf_prog_4eb421be69ae82fa_start_timer+0x5d/0x7e
   bpf_prog_test_run_syscall+0x322/0x830
   __sys_bpf+0x135d/0x3ca0
   __x64_sys_bpf+0x75/0xb0
   x64_sys_call+0x1b5/0xa10
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fix the problem by using bpf_global_ma to allocate bpf_async_cb when
PREEMPT_RT is enabled. The reason for still using kmalloc for
no-PREEMPT_RT case is that bpf_global_ma doesn't support accouting the
allocated memory to specific memcg. Also doing the memory allocation
before invoking __bpf_spin_lock_irqsave() to reduce the possibility of
-ENOMEM for bpf_global_ma.

Signed-off-by: Hou Tao <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 17, 2025
During the update procedure, when overwrite element in a pre-allocated
htab, the freeing of old_element is protected by the bucket lock. The
reason why the bucket lock is necessary is that the old_element has
already been stashed in htab->extra_elems after alloc_htab_elem()
returns. If freeing the old_element after the bucket lock is unlocked,
the stashed element may be reused by concurrent update procedure and the
freeing of old_element will run concurrently with the reuse of the
old_element. However, the invocation of check_and_free_fields() may
acquire a spin-lock which violates the lockdep rule because its caller
has already held a raw-spin-lock (bucket lock). The following warning
will be reported when such race happens:

  BUG: scheduling while atomic: test_progs/676/0x00000003
  3 locks held by test_progs/676:
  #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
  #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
  #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
  Modules linked in: bpf_testmod(O)
  Preemption disabled at:
  [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
  CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
  Tainted: [W]=WARN, [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
  Call Trace:
  <TASK>
  dump_stack_lvl+0x57/0x70
  dump_stack+0x10/0x20
  __schedule_bug+0x120/0x170
  __schedule+0x300c/0x4800
  schedule_rtlock+0x37/0x60
  rtlock_slowlock_locked+0x6d9/0x54c0
  rt_spin_lock+0x168/0x230
  hrtimer_cancel_wait_running+0xe9/0x1b0
  hrtimer_cancel+0x24/0x30
  bpf_timer_delete_work+0x1d/0x40
  bpf_timer_cancel_and_free+0x5e/0x80
  bpf_obj_free_fields+0x262/0x4a0
  check_and_free_fields+0x1d0/0x280
  htab_map_update_elem+0x7fc/0x1500
  bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43
  bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e
  bpf_prog_test_run_syscall+0x322/0x830
  __sys_bpf+0x135d/0x3ca0
  __x64_sys_bpf+0x75/0xb0
  x64_sys_call+0x1b5/0xa10
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  ...
  </TASK>

It seems feasible to break the reuse and refill of per-cpu extra_elems
into two independent parts: reuse the per-cpu extra_elems with bucket
lock being held and refill the old_element as per-cpu extra_elems after
the bucket lock is unlocked. However, it will make the concurrent
overwrite procedures on the same CPU return unexpected -E2BIG error when
the map is full.

Therefore, the patch fixes the lock problem by breaking the cancelling
of bpf_timer into two steps for PREEMPT_RT:
1) use hrtimer_try_to_cancel() and check its return value
2) if the timer is running, use hrtimer_cancel() through a kworker to
   cancel it again
Considering that the current implementation of hrtimer_cancel() will try
to acquire a being held softirq_expiry_lock when the current timer is
running, these steps above are reasonable. However, it also has
downside. When the timer is running, the cancelling of the timer is
delayed when releasing the last map uref. The delay is also fixable
(e.g., break the cancelling of bpf timer into two parts: one part in
locked scope, another one in unlocked scope), it can be revised later if
necessary.

It is a bit hard to decide the right fix tag. One reason is that the
problem depends on PREEMPT_RT which is enabled in v6.12. Considering the
softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced
in v5.15, the bpf_timer commit is used in the fixes tag and an extra
depends-on tag is added to state the dependency on PREEMPT_RT.

Fixes: b00628b ("bpf: Introduce bpf timers.")
Depends-on: v6.12+ with PREEMPT_RT enabled
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Closes: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Hou Tao <[email protected]>
Reviewed-by: Toke Høiland-Jørgensen <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Jan 20, 2025
During the update procedure, when overwrite element in a pre-allocated
htab, the freeing of old_element is protected by the bucket lock. The
reason why the bucket lock is necessary is that the old_element has
already been stashed in htab->extra_elems after alloc_htab_elem()
returns. If freeing the old_element after the bucket lock is unlocked,
the stashed element may be reused by concurrent update procedure and the
freeing of old_element will run concurrently with the reuse of the
old_element. However, the invocation of check_and_free_fields() may
acquire a spin-lock which violates the lockdep rule because its caller
has already held a raw-spin-lock (bucket lock). The following warning
will be reported when such race happens:

  BUG: scheduling while atomic: test_progs/676/0x00000003
  3 locks held by test_progs/676:
  #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
  #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
  #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
  Modules linked in: bpf_testmod(O)
  Preemption disabled at:
  [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
  CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
  Tainted: [W]=WARN, [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
  Call Trace:
  <TASK>
  dump_stack_lvl+0x57/0x70
  dump_stack+0x10/0x20
  __schedule_bug+0x120/0x170
  __schedule+0x300c/0x4800
  schedule_rtlock+0x37/0x60
  rtlock_slowlock_locked+0x6d9/0x54c0
  rt_spin_lock+0x168/0x230
  hrtimer_cancel_wait_running+0xe9/0x1b0
  hrtimer_cancel+0x24/0x30
  bpf_timer_delete_work+0x1d/0x40
  bpf_timer_cancel_and_free+0x5e/0x80
  bpf_obj_free_fields+0x262/0x4a0
  check_and_free_fields+0x1d0/0x280
  htab_map_update_elem+0x7fc/0x1500
  bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43
  bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e
  bpf_prog_test_run_syscall+0x322/0x830
  __sys_bpf+0x135d/0x3ca0
  __x64_sys_bpf+0x75/0xb0
  x64_sys_call+0x1b5/0xa10
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  ...
  </TASK>

It seems feasible to break the reuse and refill of per-cpu extra_elems
into two independent parts: reuse the per-cpu extra_elems with bucket
lock being held and refill the old_element as per-cpu extra_elems after
the bucket lock is unlocked. However, it will make the concurrent
overwrite procedures on the same CPU return unexpected -E2BIG error when
the map is full.

Therefore, the patch fixes the lock problem by breaking the cancelling
of bpf_timer into two steps for PREEMPT_RT:
1) use hrtimer_try_to_cancel() and check its return value
2) if the timer is running, use hrtimer_cancel() through a kworker to
   cancel it again
Considering that the current implementation of hrtimer_cancel() will try
to acquire a being held softirq_expiry_lock when the current timer is
running, these steps above are reasonable. However, it also has
downside. When the timer is running, the cancelling of the timer is
delayed when releasing the last map uref. The delay is also fixable
(e.g., break the cancelling of bpf timer into two parts: one part in
locked scope, another one in unlocked scope), it can be revised later if
necessary.

It is a bit hard to decide the right fix tag. One reason is that the
problem depends on PREEMPT_RT which is enabled in v6.12. Considering the
softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced
in v5.15, the bpf_timer commit is used in the fixes tag and an extra
depends-on tag is added to state the dependency on PREEMPT_RT.

Fixes: b00628b ("bpf: Introduce bpf timers.")
Depends-on: v6.12+ with PREEMPT_RT enabled
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Closes: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Hou Tao <[email protected]>
Reviewed-by: Toke Høiland-Jørgensen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Feb 3, 2025
libtraceevent parses and returns an array of argument fields, sometimes
larger than RAW_SYSCALL_ARGS_NUM (6) because it includes "__syscall_nr",
idx will traverse to index 6 (7th element) whereas sc->fmt->arg holds 6
elements max, creating an out-of-bounds access. This runtime error is
found by UBsan. The error message:

  $ sudo UBSAN_OPTIONS=print_stacktrace=1 ./perf trace -a --max-events=1
  builtin-trace.c:1966:35: runtime error: index 6 out of bounds for type 'syscall_arg_fmt [6]'
    #0 0x5c04956be5fe in syscall__alloc_arg_fmts /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:1966
    #1 0x5c04956c0510 in trace__read_syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2110
    #2 0x5c04956c372b in trace__syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2436
    #3 0x5c04956d2f39 in trace__init_syscalls_bpf_prog_array_maps /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:3897
    #4 0x5c04956d6d25 in trace__run /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:4335
    #5 0x5c04956e112e in cmd_trace /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:5502
    #6 0x5c04956eda7d in run_builtin /home/howard/hw/linux-perf/tools/perf/perf.c:351
    #7 0x5c04956ee0a8 in handle_internal_command /home/howard/hw/linux-perf/tools/perf/perf.c:404
    #8 0x5c04956ee37f in run_argv /home/howard/hw/linux-perf/tools/perf/perf.c:448
    #9 0x5c04956ee8e9 in main /home/howard/hw/linux-perf/tools/perf/perf.c:556
    #10 0x79eb3622a3b7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #11 0x79eb3622a47a in __libc_start_main_impl ../csu/libc-start.c:360
    #12 0x5c04955422d4 in _start (/home/howard/hw/linux-perf/tools/perf/perf+0x4e02d4) (BuildId: 5b6cab2d59e96a4341741765ad6914a4d784dbc6)

     0.000 ( 0.014 ms): Chrome_ChildIO/117244 write(fd: 238, buf: !, count: 1)                                      = 1

Fixes: 5e58fcf ("perf trace: Allow allocating sc->arg_fmt even without the syscall tracepoint")
Signed-off-by: Howard Chu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Feb 4, 2025
On an aarch64 kernel with CONFIG_PAGE_SIZE_64KB=y (64k pages),
arena_htab tests cause a segmentation fault and soft lockup.

$ sudo ./test_progs -t arena_htab
Caught signal #11!
Stack trace:
./test_progs(crash_handler+0x1c)[0x7bd4d8]
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb34a0968]
./test_progs[0x420f74]
./test_progs(htab_lookup_elem+0x3c)[0x421090]
./test_progs[0x421320]
./test_progs[0x421bb8]
./test_progs(test_arena_htab+0x40)[0x421c14]
./test_progs[0x7bda84]
./test_progs(main+0x65c)[0x7bf670]
/usr/lib64/libc.so.6(+0x2caa0)[0xffffb31ecaa0]
/usr/lib64/libc.so.6(__libc_start_main+0x98)[0xffffb31ecb78]
./test_progs(_start+0x30)[0x41b4f0]

Message from syslogd@bpfol9aarch64 at Feb  4 08:50:09 ...
 kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/u8:4:7589]

The same failure is not observed with 4k pages on aarch64.

Investigating further, it turns out arena_map_free() was calling
apply_to_existing_page_range() with the address returned by
bpf_arena_get_kern_vm_start().  If this address is not page-aligned -
as is the case for a 64k page kernel - we wind up calling apply_to_pte_range()
with that unaligned address.  The problem is apply_to_pte_range() implicitly
assumes that the addr passed in is page-aligned, specifically in this loop:

		do {
                        if (create || !pte_none(ptep_get(pte))) {
                                err = fn(pte++, addr, data);
                                if (err)
                                        break;
                        }
                } while (addr += PAGE_SIZE, addr != end);

If addr is _not_ page-aligned, it will never equal end exactly.

One solution is to round up the address returned by bpf_arena_get_kern_vm_start()
to a page-aligned value.  With that change in place the test passes:

$ sudo ./test_progs -t arena_htab
Summary: 1/1 PASSED, 1 SKIPPED, 0 FAILED

Reported-by: Colm Harrington <[email protected]>
Signed-off-by: Alan Maguire <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Feb 5, 2025
On an aarch64 kernel with CONFIG_PAGE_SIZE_64KB=y (64k pages),
arena_htab tests cause a segmentation fault and soft lockup.

$ sudo ./test_progs -t arena_htab
Caught signal #11!
Stack trace:
./test_progs(crash_handler+0x1c)[0x7bd4d8]
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb34a0968]
./test_progs[0x420f74]
./test_progs(htab_lookup_elem+0x3c)[0x421090]
./test_progs[0x421320]
./test_progs[0x421bb8]
./test_progs(test_arena_htab+0x40)[0x421c14]
./test_progs[0x7bda84]
./test_progs(main+0x65c)[0x7bf670]
/usr/lib64/libc.so.6(+0x2caa0)[0xffffb31ecaa0]
/usr/lib64/libc.so.6(__libc_start_main+0x98)[0xffffb31ecb78]
./test_progs(_start+0x30)[0x41b4f0]

Message from syslogd@bpfol9aarch64 at Feb  4 08:50:09 ...
 kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/u8:4:7589]

The same failure is not observed with 4k pages on aarch64.

Investigating further, it turns out arena_map_free() was calling
apply_to_existing_page_range() with the address returned by
bpf_arena_get_kern_vm_start().  If this address is not page-aligned -
as is the case for a 64k page kernel - we wind up calling apply_to_pte_range()
with that unaligned address.  The problem is apply_to_pte_range() implicitly
assumes that the addr passed in is page-aligned, specifically in this loop:

		do {
                        if (create || !pte_none(ptep_get(pte))) {
                                err = fn(pte++, addr, data);
                                if (err)
                                        break;
                        }
                } while (addr += PAGE_SIZE, addr != end);

If addr is _not_ page-aligned, it will never equal end exactly.

One solution is to round up GUARD_SZ to PAGE_SIZE << 1 so that the
division by 2 in bpf_arena_get_kern_vm_start() returns a page-aligned
value.  With that change in place, the test passes:

$ sudo ./test_progs -t arena_htab
Summary: 1/1 PASSED, 1 SKIPPED, 0 FAILED

Fixes: 3174603 ("bpf: Introduce bpf_arena.")
Reported-by: Colm Harrington <[email protected]>
Suggested-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Alan Maguire <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Feb 5, 2025
On an aarch64 kernel with CONFIG_PAGE_SIZE_64KB=y (64k pages),
arena_htab tests cause a segmentation fault and soft lockup.

$ sudo ./test_progs -t arena_htab
Caught signal #11!
Stack trace:
./test_progs(crash_handler+0x1c)[0x7bd4d8]
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb34a0968]
./test_progs[0x420f74]
./test_progs(htab_lookup_elem+0x3c)[0x421090]
./test_progs[0x421320]
./test_progs[0x421bb8]
./test_progs(test_arena_htab+0x40)[0x421c14]
./test_progs[0x7bda84]
./test_progs(main+0x65c)[0x7bf670]
/usr/lib64/libc.so.6(+0x2caa0)[0xffffb31ecaa0]
/usr/lib64/libc.so.6(__libc_start_main+0x98)[0xffffb31ecb78]
./test_progs(_start+0x30)[0x41b4f0]

Message from syslogd@bpfol9aarch64 at Feb  4 08:50:09 ...
 kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/u8:4:7589]

The same failure is not observed with 4k pages on aarch64.

Investigating further, it turns out arena_map_free() was calling
apply_to_existing_page_range() with the address returned by
bpf_arena_get_kern_vm_start().  If this address is not page-aligned -
as is the case for a 64k page kernel - we wind up calling apply_to_pte_range()
with that unaligned address.  The problem is apply_to_pte_range() implicitly
assumes that the addr passed in is page-aligned, specifically in this loop:

		do {
                        if (create || !pte_none(ptep_get(pte))) {
                                err = fn(pte++, addr, data);
                                if (err)
                                        break;
                        }
                } while (addr += PAGE_SIZE, addr != end);

If addr is _not_ page-aligned, it will never equal end exactly.

One solution is to round up GUARD_SZ to PAGE_SIZE << 1 so that the
division by 2 in bpf_arena_get_kern_vm_start() returns a page-aligned
value.  With that change in place, the test passes:

$ sudo ./test_progs -t arena_htab
Summary: 1/1 PASSED, 1 SKIPPED, 0 FAILED

Fixes: 3174603 ("bpf: Introduce bpf_arena.")
Reported-by: Colm Harrington <[email protected]>
Suggested-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Alan Maguire <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Feb 5, 2025
On an aarch64 kernel with CONFIG_PAGE_SIZE_64KB=y (64k pages),
arena_htab tests cause a segmentation fault and soft lockup.

$ sudo ./test_progs -t arena_htab
Caught signal #11!
Stack trace:
./test_progs(crash_handler+0x1c)[0x7bd4d8]
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb34a0968]
./test_progs[0x420f74]
./test_progs(htab_lookup_elem+0x3c)[0x421090]
./test_progs[0x421320]
./test_progs[0x421bb8]
./test_progs(test_arena_htab+0x40)[0x421c14]
./test_progs[0x7bda84]
./test_progs(main+0x65c)[0x7bf670]
/usr/lib64/libc.so.6(+0x2caa0)[0xffffb31ecaa0]
/usr/lib64/libc.so.6(__libc_start_main+0x98)[0xffffb31ecb78]
./test_progs(_start+0x30)[0x41b4f0]

Message from syslogd@bpfol9aarch64 at Feb  4 08:50:09 ...
 kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/u8:4:7589]

The same failure is not observed with 4k pages on aarch64.

Investigating further, it turns out arena_map_free() was calling
apply_to_existing_page_range() with the address returned by
bpf_arena_get_kern_vm_start().  If this address is not page-aligned -
as is the case for a 64k page kernel - we wind up calling apply_to_pte_range()
with that unaligned address.  The problem is apply_to_pte_range() implicitly
assumes that the addr passed in is page-aligned, specifically in this loop:

		do {
                        if (create || !pte_none(ptep_get(pte))) {
                                err = fn(pte++, addr, data);
                                if (err)
                                        break;
                        }
                } while (addr += PAGE_SIZE, addr != end);

If addr is _not_ page-aligned, it will never equal end exactly.

One solution is to round up GUARD_SZ to PAGE_SIZE << 1 so that the
division by 2 in bpf_arena_get_kern_vm_start() returns a page-aligned
value.  With that change in place, the test passes:

$ sudo ./test_progs -t arena_htab
Summary: 1/1 PASSED, 1 SKIPPED, 0 FAILED

Fixes: 3174603 ("bpf: Introduce bpf_arena.")
Reported-by: Colm Harrington <[email protected]>
Suggested-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Alan Maguire <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Mar 21, 2025
Chia-Yu Chang says:

====================
AccECN protocol preparation patch series

Please find the v7

v7 (03-Mar-2025)
- Move 2 new patches added in v6 to the next AccECN patch series

v6 (27-Dec-2024)
- Avoid removing removing the potential CA_ACK_WIN_UPDATE in ack_ev_flags of patch #1 (Eric Dumazet <[email protected]>)
- Add reviewed-by tag in patches #2, #3, #4, #5, #6, #7, #8, #12, #14
- Foloiwng 2 new pathces are added after patch #9 (Patch that adds SKB_GSO_TCP_ACCECN)
  * New patch #10 to replace exisiting SKB_GSO_TCP_ECN with SKB_GSO_TCP_ACCECN in the driver to avoid CWR flag corruption
  * New patch #11 adds AccECN for virtio by adding new negotiation flag (VIRTIO_NET_F_HOST/GUEST_ACCECN) in feature handshake and translating Accurate ECN GSO flag between virtio_net_hdr (VIRTIO_NET_HDR_GSO_ACCECN) and skb header (SKB_GSO_TCP_ACCECN)
- Add detailed changelog and comments in #13 (Eric Dumazet <[email protected]>)
- Move patch #14 to the next AccECN patch series (Eric Dumazet <[email protected]>)

v5 (5-Nov-2024)
- Add helper function "tcp_flags_ntohs" to preserve last 2 bytes of TCP flags of patch #4 (Paolo Abeni <[email protected]>)
- Fix reverse X-max tree order of patches #4, #11 (Paolo Abeni <[email protected]>)
- Rename variable "delta" as "timestamp_delta" of patch #2 fo clariety
- Remove patch #14 in this series (Paolo Abeni <[email protected]>, Joel Granados <[email protected]>)

v4 (21-Oct-2024)
- Fix line length warning of patches #2, #4, #8, #10, #11, #14
- Fix spaces preferred around '|' (ctx:VxV) warning of patch #7
- Add missing CC'ed of patches #4, #12, #14

v3 (19-Oct-2024)
- Fix build error in v2

v2 (18-Oct-2024)
- Fix warning caused by NETIF_F_GSO_ACCECN_BIT in patch #9 (Jakub Kicinski <[email protected]>)

The full patch series can be found in
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/

The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28
====================

Signed-off-by: David S. Miller <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Apr 1, 2025
perf test 11 hwmon fails on s390 with this error

 # ./perf test -Fv 11
 --- start ---
 ---- end ----
 11.1: Basic parsing test             : Ok
 --- start ---
 Testing 'temp_test_hwmon_event1'
 Using CPUID IBM,3931,704,A01,3.7,002f
 temp_test_hwmon_event1 -> hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/
 FAILED tests/hwmon_pmu.c:189 Unexpected config for
    'temp_test_hwmon_event1', 292470092988416 != 655361
 ---- end ----
 11.2: Parsing without PMU name       : FAILED!
 --- start ---
 Testing 'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/'
 FAILED tests/hwmon_pmu.c:189 Unexpected config for
    'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/',
    292470092988416 != 655361
 ---- end ----
 11.3: Parsing with PMU name          : FAILED!
 #

The root cause is in member test_event::config which is initialized
to 0xA0001 or 655361. During event parsing a long list event parsing
functions are called and end up with this gdb call stack:

 #0  hwmon_pmu__config_term (hwm=0x168dfd0, attr=0x3ffffff5ee8,
	term=0x168db60, err=0x3ffffff81c8) at util/hwmon_pmu.c:623
 #1  hwmon_pmu__config_terms (pmu=0x168dfd0, attr=0x3ffffff5ee8,
	terms=0x3ffffff5ea8, err=0x3ffffff81c8) at util/hwmon_pmu.c:662
 #2  0x00000000012f870c in perf_pmu__config_terms (pmu=0x168dfd0,
	attr=0x3ffffff5ee8, terms=0x3ffffff5ea8, zero=false,
	apply_hardcoded=false, err=0x3ffffff81c8) at util/pmu.c:1519
 #3  0x00000000012f88a4 in perf_pmu__config (pmu=0x168dfd0, attr=0x3ffffff5ee8,
	head_terms=0x3ffffff5ea8, apply_hardcoded=false, err=0x3ffffff81c8)
	at util/pmu.c:1545
 #4  0x00000000012680c4 in parse_events_add_pmu (parse_state=0x3ffffff7fb8,
	list=0x168dc00, pmu=0x168dfd0, const_parsed_terms=0x3ffffff6090,
	auto_merge_stats=true, alternate_hw_config=10)
	at util/parse-events.c:1508
 #5  0x00000000012684c6 in parse_events_multi_pmu_add (parse_state=0x3ffffff7fb8,
	event_name=0x168ec10 "temp_test_hwmon_event1", hw_config=10,
	const_parsed_terms=0x0, listp=0x3ffffff6230, loc_=0x3ffffff70e0)
	at util/parse-events.c:1592
 #6  0x00000000012f0e4e in parse_events_parse (_parse_state=0x3ffffff7fb8,
	scanner=0x16878c0) at util/parse-events.y:293
 #7  0x00000000012695a0 in parse_events__scanner (str=0x3ffffff81d8
	"temp_test_hwmon_event1", input=0x0, parse_state=0x3ffffff7fb8)
	at util/parse-events.c:1867
 #8  0x000000000126a1e8 in __parse_events (evlist=0x168b580,
	str=0x3ffffff81d8 "temp_test_hwmon_event1", pmu_filter=0x0,
	err=0x3ffffff81c8, fake_pmu=false, warn_if_reordered=true,
	fake_tp=false) at util/parse-events.c:2136
 #9  0x00000000011e36aa in parse_events (evlist=0x168b580,
	str=0x3ffffff81d8 "temp_test_hwmon_event1", err=0x3ffffff81c8)
	at /root/linux/tools/perf/util/parse-events.h:41
 #10 0x00000000011e3e64 in do_test (i=0, with_pmu=false, with_alias=false)
	at tests/hwmon_pmu.c:164
 #11 0x00000000011e422c in test__hwmon_pmu (with_pmu=false)
	at tests/hwmon_pmu.c:219
 #12 0x00000000011e431c in test__hwmon_pmu_without_pmu (test=0x1610368
	<suite.hwmon_pmu>, subtest=1) at tests/hwmon_pmu.c:23

where the attr::config is set to value 292470092988416 or 0x10a0000000000
in line 625 of file ./util/hwmon_pmu.c:

   attr->config = key.type_and_num;

However member key::type_and_num is defined as union and bit field:

   union hwmon_pmu_event_key {
        long type_and_num;
        struct {
                int num :16;
                enum hwmon_type type :8;
        };
   };

s390 is big endian and Intel is little endian architecture.
The events for the hwmon dummy pmu have num = 1 or num = 2 and
type is set to HWMON_TYPE_TEMP (which is 10).
On s390 this assignes member key::type_and_num the value of
0x10a0000000000 (which is 292470092988416) as shown in above
trace output.

Fix this and export the structure/union hwmon_pmu_event_key
so the test shares the same implementation as the event parsing
functions for union and bit fields. This should avoid
endianess issues on all platforms.

Output after:
 # ./perf test -F 11
 11.1: Basic parsing test         : Ok
 11.2: Parsing without PMU name   : Ok
 11.3: Parsing with PMU name      : Ok
 #

Fixes: 531ee0f ("perf test: Add hwmon "PMU" test")
Signed-off-by: Thomas Richter <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Apr 1, 2025
Ian told me that there are many memory leaks in the hierarchy mode.  I
can easily reproduce it with the follwing command.

  $ make DEBUG=1 EXTRA_CFLAGS=-fsanitize=leak

  $ perf record --latency -g -- ./perf test -w thloop

  $ perf report -H --stdio
  ...
  Indirect leak of 168 byte(s) in 21 object(s) allocated from:
      #0 0x7f3414c16c65 in malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:75
      #1 0x55ed3602346e in map__get util/map.h:189
      #2 0x55ed36024cc4 in hist_entry__init util/hist.c:476
      #3 0x55ed36025208 in hist_entry__new util/hist.c:588
      #4 0x55ed36027c05 in hierarchy_insert_entry util/hist.c:1587
      #5 0x55ed36027e2e in hists__hierarchy_insert_entry util/hist.c:1638
      #6 0x55ed36027fa4 in hists__collapse_insert_entry util/hist.c:1685
      #7 0x55ed360283e8 in hists__collapse_resort util/hist.c:1776
      #8 0x55ed35de0323 in report__collapse_hists /home/namhyung/project/linux/tools/perf/builtin-report.c:735
      #9 0x55ed35de15b4 in __cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1119
      #10 0x55ed35de43dc in cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1867
      #11 0x55ed35e66767 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:351
      #12 0x55ed35e66a0e in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:404
      #13 0x55ed35e66b67 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:448
      #14 0x55ed35e66eb0 in main /home/namhyung/project/linux/tools/perf/perf.c:556
      #15 0x7f340ac33d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
  ...

  $ perf report -H --stdio 2>&1 | grep -c '^Indirect leak'
  93

I found that hist_entry__delete() missed to release child entries in the
hierarchy tree (hroot_{in,out}).  It needs to iterate the child entries
and call hist_entry__delete() recursively.

After this change:

  $ perf report -H --stdio 2>&1 | grep -c '^Indirect leak'
  0

Reported-by: Ian Rogers <[email protected]>
Tested-by Thomas Falcon <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Apr 1, 2025
The env.pmu_mapping can be leaked when it reads data from a pipe on AMD.
For a pipe data, it reads the header data including pmu_mapping from
PERF_RECORD_HEADER_FEATURE runtime.  But it's already set in:

  perf_session__new()
    __perf_session__new()
      evlist__init_trace_event_sample_raw()
        evlist__has_amd_ibs()
          perf_env__nr_pmu_mappings()

Then it'll overwrite that when it processes the HEADER_FEATURE record.
Here's a report from address sanitizer.

  Direct leak of 2689 byte(s) in 1 object(s) allocated from:
    #0 0x7fed8f814596 in realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98
    #1 0x5595a7d416b1 in strbuf_grow util/strbuf.c:64
    #2 0x5595a7d414ef in strbuf_init util/strbuf.c:25
    #3 0x5595a7d0f4b7 in perf_env__read_pmu_mappings util/env.c:362
    #4 0x5595a7d12ab7 in perf_env__nr_pmu_mappings util/env.c:517
    #5 0x5595a7d89d2f in evlist__has_amd_ibs util/amd-sample-raw.c:315
    #6 0x5595a7d87fb2 in evlist__init_trace_event_sample_raw util/sample-raw.c:23
    #7 0x5595a7d7f893 in __perf_session__new util/session.c:179
    #8 0x5595a7b79572 in perf_session__new util/session.h:115
    #9 0x5595a7b7e9dc in cmd_report builtin-report.c:1603
    #10 0x5595a7c019eb in run_builtin perf.c:351
    #11 0x5595a7c01c92 in handle_internal_command perf.c:404
    #12 0x5595a7c01deb in run_argv perf.c:448
    #13 0x5595a7c02134 in main perf.c:556
    #14 0x7fed85833d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Let's free the existing pmu_mapping data if any.

Cc: Ravi Bangoria <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Apr 1, 2025
…ge_order()

Patch series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT", v3.

Let's add an "easy" way to decide -- without false positives, without
page-mapcounts and without page table/rmap scanning -- whether a large
folio is "certainly mapped exclusively" into a single MM, or whether it
"maybe mapped shared" into multiple MMs.

Use that information to implement Copy-on-Write reuse, to convert
folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to
introduce a kernel config option that lets us not use+maintain per-page
mapcounts in large folios anymore.

The bigger picture was presented at LSF/MM [1].

This series is effectively a follow-up on my early work [2], which
implemented a more precise, but also more complicated, way to identify
whether a large folio is "mapped shared" into multiple MMs or "mapped
exclusively" into a single MM.


1 Patch Organization
====================

Patch #1 -> #6: make more room in order-1 folios, so we have two
                "unsigned long" available for our purposes

Patch #7 -> #11: preparations

Patch #12: MM owner tracking for large folios

Patch #13: COW reuse for PTE-mapped anon THP

Patch #14: folio_maybe_mapped_shared()

Patch #15 -> #20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT


2 MM owner tracking
===================

We assign each MM a unique ID ("MM ID"), to be able to squeeze more
information in our folios.  On 32bit we use 15-bit IDs, on 64bit we use
31-bit IDs.

For each large folios, we now store two MM-ID+mapcount ("slot")
combinations:
* mm0_id + mm0_mapcount
* mm1_id + mm1_mapcount

On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit
mapcount.  This way, we require 2x "unsigned long" on 32bit and 64bit for
both slots.

Paired with the large mapcount, we can reliably identify whether one of
these MMs is the current owner (-> owns all mappings) or even holds all
folio references (-> owns all mappings, and all references are from
mappings).

As long as only two MMs map folio pages at a time, we can reliably and
precisely identify whether a large folio is "mapped shared" or "mapped
exclusively".

Any additional MM that starts mapping the folio while there are no free
slots becomes an "untracked MM".  If one such "untracked MM" is the last
one mapping a folio exclusively, we will not detect the folio as "mapped
exclusively" but instead as "maybe mapped shared".  (exception: only a
single mapping remains)

So that's where the approach gets imprecise.

For now, we use a bit-spinlock to sync the large mapcount + slots, and
make sure we do keep the machinery fast, to not degrade (un)map
performance drastically: for example, we make sure to only use a single
atomic (when grabbing the bit-spinlock), like we would already perform
when updating the large mapcount.


3 CONFIG_NO_PAGE_MAPCOUNT
=========================

patch #15 -> #20 spell out and document what exactly is affected when not
maintaining the per-page mapcounts in large folios anymore.

Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore
when (un)mapping pages, we'll account a complete folio as mapped if a
single page is mapped.  In addition, we'll not detect partially mapped
anonymous folios as such in all cases yet.

Likely less relevant changes include that we might now under-estimate the
USS (Unique Set Size) of a process, but never over-estimate it.

The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to
then slowly make it the only option, as we learn about real-life impacts
and possible ways to mitigate them.


4 Performance
=============

Detailed performance numbers were included in v1 [3], and not that much
changed between v1 and v2.

I did plenty of measurements on different systems in the meantime, that
all revealed slightly different results.

The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code
layout changes on some systems.  Especially the fork() benchmark started
being more-shaky-than-before on recent kernels for some reason.

In summary, with my micro-benchmarks:

* Small folios are not impacted.

* CoW performance seems to be mostly unchanged across all folios sizes.

* CoW reuse performance of large folios now matches CoW reuse
  performance of small folios, because we now actually implement the CoW
  reuse optimization.  On an Intel Xeon Silver 4210R I measured a ~65%
  reduction in runtime, on an arm64 system I measured ~54% reduction.

* munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and
  up to ~70% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* munmao() performance very slightly (couple percent) degrades without
  CONFIG_NO_PAGE_MAPCOUNT for smaller folios.  For larger folios, there
  seems to be no change at all.

* fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT.  I saw
  double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and
  up to ~10% on an AmpereOne A192-32X) with larger folios.  The larger the
  folios, the larger the performance improvement.

* While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be
  almost unchanged on some systems, I saw some degradation for smaller
  folios on the AmpereOne A192-32X.  I did not investigate the details
  yet, but I suspect code layout changes or suboptimal code placement /
  inlining.

I'm not to worried about the fork() micro-benchmarks for smaller folios
given how shaky the results are lately and by how much we improved fork()
performance recently.

I also ran case-anon-cow-rand and case-anon-cow-seq part of
vm-scalability, to assess the scalability and the impact of the
bit-spinlock.  My measurements on a two 2-socket 10-core Intel Xeon Silver
4210R CPU revealed no significant changes.

Similarly, running these benchmarks with 2 MiB THPs enabled on the
AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev,
which is nice.

So far, I did not get my hands on a similarly large system with multiple
sockets.

I found no other fitting scalability benchmarks that seem to really hammer
on concurrent mapping/unmapping of large folio pages like
case-anon-cow-seq does.


5 Concerns
==========

5.1 Bit spinlock
----------------

I'm not quite happy about the bit-spinlock, but so far it does not seem to
affect scalability in my measurements.

If it ever becomes a problem we could either investigate improving the
locking, or simply stopping the MM tracking once there are "too many
mappings" and simply assume that the folio is "mapped shared" until it was
freed.

This would be similar (but slightly different) to the "0,1,2,stopped"
counting idea Willy had at some point.  Adding that logic to "stop
tracking" adds more code to the hot path, so I avoided that for now.


5.2 folio_maybe_mapped_shared()
-------------------------------

I documented the change from folio_likely_mapped_shared() to
folio_maybe_mapped_shared() quite extensively.  If we run into surprises,
I have some ideas on how to resolve them.  For now, I think we should be
fine.


5.3 Added code to map/unmap hot path
------------------------------------

So far, it looks like the added code on the rmap hot path does not really
seem to matter much in the bigger picture.  I'd like to further reduce it
(and possibly improve fork() performance further), but I don't easily see
how right now.  Well, and I am out of puff 🙂

Having that said, alternatives I considered (e.g., per-MM per-folio
mapcount) would add a lot more overhead to these hot paths.


6 Future Work
=============

6.1 Large mapcount
------------------

It would be very handy if the large mapcount would count how often folio
pages are actually mapped into page tables: a PMD on x86-64 would count
512 times.  Calculating the average per-page mapcount will be easy, and
remapping (PMD->PTE) folios would get even faster.

That would also remove the need for the entire mapcount (except for
PMD-sized folios for memory statistics reasons ...), and allow for mapping
folios larger than PMDs (e.g., 4 MiB) easily.

We likely would also have to take the same number of folio references to
make our folio_mapcount() == folio_ref_count() work, and we'd want to be
able to avoid mapcount+refcount overflows: this could already become an
issue with pte-mapped PUD-sized folios (fsdax).

One approach we discussed in the THP cabal meeting is (1) extending the
mapcount for large folios to 64bit (at least on 64bit systems) and (2)
keeping the refcount at 32bit, but (3) having exactly one reference if the
the mapcount != 0.

It should be doable, but there are some corner cases to consider on the
unmap path; it is something that I will be looking into next.


6.2 hugetlb
-----------

I'd love to make use of the same tracking also for hugetlb.

The real problem is PMD table sharing: getting a page mapped by MM X and
unmapped by MM Y will not work.  With mshare, that problem should not
exist (all mapping/unmapping will be routed through the mshare MM).

[1] https://lwn.net/Articles/974223/
[2] https://lore.kernel.org/linux-mm/[email protected]/T/
[3] https://lkml.kernel.org/r/[email protected]
[4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c


This patch (of 20):

Let's factor it out into a simple helper function.  This helper will also
come in handy when working with code where we know that our folio is
large.

Maybe in the future we'll have the order readily available for small and
large folios; in that case, folio_large_order() would simply translate to
folio_order().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirks^H^Hski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michal Koutn <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: tejun heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Apr 3, 2025
When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush()
generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC,
which causes the flush_bio to be throttled by wbt_wait().

An example from v5.4, similar problem also exists in upstream:

    crash> bt 2091206
    PID: 2091206  TASK: ffff2050df92a300  CPU: 109  COMMAND: "kworker/u260:0"
     #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8
     #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4
     #2 [ffff800084a2f880] schedule at ffff800040bfa4b4
     #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4
     #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc
     #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0
     #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254
     #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38
     #8 [ffff800084a2fa60] generic_make_request at ffff800040570138
     #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4
    #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs]
    #11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs]
    #12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs]
    #13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs]
    #14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs]
    #15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs]
    #16 [ffff800084a2fdb0] process_one_work at ffff800040111d08
    #17 [ffff800084a2fe00] worker_thread at ffff8000401121cc
    #18 [ffff800084a2fe70] kthread at ffff800040118de4

After commit 2def284 ("xfs: don't allow log IO to be throttled"),
the metadata submitted by xlog_write_iclog() should not be throttled.
But due to the existence of the dm layer, throttling flush_bio indirectly
causes the metadata bio to be throttled.

Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes
wbt_should_throttle() return false to avoid wbt_wait().

Signed-off-by: Jinliang Zheng <[email protected]>
Reviewed-by: Tianxiang Peng <[email protected]>
Reviewed-by: Hao Peng <[email protected]>
Signed-off-by: Mikulas Patocka <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Apr 23, 2025
Ido Schimmel says:

====================
vxlan: Convert FDB table to rhashtable

The VXLAN driver currently stores FDB entries in a hash table with a
fixed number of buckets (256), resulting in reduced performance as the
number of entries grows. This patchset solves the issue by converting
the driver to use rhashtable which maintains a more or less constant
performance regardless of the number of entries.

Measured transmitted packets per second using a single pktgen thread
with varying number of entries when the transmitted packet always hits
the default entry (worst case):

Number of entries | Improvement
------------------|------------
1k                | +1.12%
4k                | +9.22%
16k               | +55%
64k               | +585%
256k              | +2460%

The first patches are preparations for the conversion in the last patch.
Specifically, the series is structured as follows:

Patch #1 adds RCU read-side critical sections in the Tx path when
accessing FDB entries. Targeting at net-next as I am not aware of any
issues due to this omission despite the code being structured that way
for a long time. Without it, traces will be generated when converting
FDB lookup to rhashtable_lookup().

Patch #2-#5 simplify the creation of the default FDB entry (all-zeroes).
Current code assumes that insertion into the hash table cannot fail,
which will no longer be true with rhashtable.

Patches #6-#10 add FDB entries to a linked list for entry traversal
instead of traversing over them using the fixed size hash table which is
removed in the last patch.

Patches #11-#12 add wrappers for FDB lookup that make it clear when each
should be used along with lockdep annotations. Needed as a preparation
for rhashtable_lookup() that must be called from an RCU read-side
critical section.

Patch #13 treats dst cache initialization errors as non-fatal. See more
info in the commit message. The current code happens to work because
insertion into the fixed size hash table is slow enough for the per-CPU
allocator to be able to create new chunks of per-CPU memory.

Patch #14 adds an FDB key structure that includes the MAC address and
source VNI. To be used as rhashtable key.

Patch #15 does the conversion to rhashtable.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants