Skip to content

Data Acknowledgement if single node is used #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brodnev opened this issue Jun 21, 2018 · 1 comment
Closed

Data Acknowledgement if single node is used #1

brodnev opened this issue Jun 21, 2018 · 1 comment

Comments

@brodnev
Copy link

brodnev commented Jun 21, 2018

The MPTCP uses Data Acknowledgement in order to retransmit data if one of the nodes fails permanently. Whatever, if there is only one (single) node between mptcp capable sender and mptcp capable receiver, does the Data Acknowledgement is still operating?

@mjmartineau
Copy link
Member

@brodnev, for protocol questions like this I recommend either the multipath-tcp.org mailing list (https://listes-2.sipr.ucl.ac.be/sympa/info/mptcp-dev) for questions about the current Linux MPTCP implementation, or the Linux MPTCP upstreaming list at https://lists.01.org/mailman/listinfo/mptcp regarding the work-in-progress implementation in this github repo.

mjmartineau pushed a commit that referenced this issue Aug 10, 2018
Running the following:

 # cd /sys/kernel/debug/tracing
 # echo 500000 > buffer_size_kb
[ Or some other number that takes up most of memory ]
 # echo snapshot > events/sched/sched_switch/trigger

Triggers the following bug:

 ------------[ cut here ]------------
 kernel BUG at mm/slub.c:296!
 invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
 CPU: 6 PID: 6878 Comm: bash Not tainted 4.18.0-rc6-test+ #1066
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
 RIP: 0010:kfree+0x16c/0x180
 Code: 05 41 0f b6 72 51 5b 5d 41 5c 4c 89 d7 e9 ac b3 f8 ff 48 89 d9 48 89 da 41 b8 01 00 00 00 5b 5d 41 5c 4c 89 d6 e9 f4 f3 ff ff <0f> 0b 0f 0b 48 8b 3d d9 d8 f9 00 e9 c1 fe ff ff 0f 1f 40 00 0f 1f
 RSP: 0018:ffffb654436d3d88 EFLAGS: 00010246
 RAX: ffff91a9d50f3d80 RBX: ffff91a9d50f3d80 RCX: ffff91a9d50f3d80
 RDX: 00000000000006a4 RSI: ffff91a9de5a60e0 RDI: ffff91a9d9803500
 RBP: ffffffff8d267c80 R08: 00000000000260e0 R09: ffffffff8c1a56be
 R10: fffff0d404543cc0 R11: 0000000000000389 R12: ffffffff8c1a56be
 R13: ffff91a9d9930e18 R14: ffff91a98c0c2890 R15: ffffffff8d267d00
 FS:  00007f363ea64700(0000) GS:ffff91a9de580000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055c1cacc8e10 CR3: 00000000d9b46003 CR4: 00000000001606e0
 Call Trace:
  event_trigger_callback+0xee/0x1d0
  event_trigger_write+0xfc/0x1a0
  __vfs_write+0x33/0x190
  ? handle_mm_fault+0x115/0x230
  ? _cond_resched+0x16/0x40
  vfs_write+0xb0/0x190
  ksys_write+0x52/0xc0
  do_syscall_64+0x5a/0x160
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7f363e16ab50
 Code: 73 01 c3 48 8b 0d 38 83 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 79 db 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 1e e3 01 00 48 89 04 24
 RSP: 002b:00007fff9a4c6378 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
 RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f363e16ab50
 RDX: 0000000000000009 RSI: 000055c1cacc8e10 RDI: 0000000000000001
 RBP: 000055c1cacc8e10 R08: 00007f363e435740 R09: 00007f363ea64700
 R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000009
 R13: 0000000000000001 R14: 00007f363e4345e0 R15: 00007f363e4303c0
 Modules linked in: ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device i915 snd_pcm snd_timer i2c_i801 snd soundcore i2c_algo_bit drm_kms_helper
86_pkg_temp_thermal video kvm_intel kvm irqbypass wmi e1000e
 ---[ end trace d301afa879ddfa25 ]---

The cause is because the register_snapshot_trigger() call failed to
allocate the snapshot buffer, and then called unregister_trigger()
which freed the data that was passed to it. Then on return to the
function that called register_snapshot_trigger(), as it sees it
failed to register, it frees the trigger_data again and causes
a double free.

By calling event_trigger_init() on the trigger_data (which only ups
the reference counter for it), and then event_trigger_free() afterward,
the trigger_data would not get freed by the registering trigger function
as it would only up and lower the ref count for it. If the register
trigger function fails, then the event_trigger_free() called after it
will free the trigger data normally.

Link: http://lkml.kernel.org/r/[email protected]

Cc: [email protected]
Fixes: 93e31ff ("tracing: Add 'snapshot' event trigger command")
Reported-by: Masami Hiramatsu <[email protected]>
Reviewed-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
The number of eRPs that can be used by a single A-TCAM region is limited
to 16. When more eRPs are needed, an ordinary circuit TCAM (C-TCAM) can
be used to hold the extra eRPs.

Unlike the A-TCAM, only a single (last) lookup is performed in the
C-TCAM and not a lookup per-eRP. However, modeling the C-TCAM as extra
eRPs will allow us to easily introduce support for pruning in a
follow-up patch set and is also logically correct.

The following diagram depicts the relation between both TCAMs:
                                                                 C-TCAM
+-------------------+               +--------------------+    +-----------+
|                   |               |                    |    |           |
|  eRP #1 (A-TCAM)  +----> ... +----+  eRP #16 (A-TCAM)  +----+  eRP #17  |
|                   |               |                    |    |    ...    |
+-------------------+               +--------------------+    |  eRP #N   |
                                                              |           |
                                                              +-----------+
Lookup order is from left to right.

Extend the eRP core APIs with a C-TCAM parameter which indicates whether
the requested eRP is to be used with the C-TCAM or not.

Since the C-TCAM is only meant to absorb rules that can't fit in the
A-TCAM due to exceeded number of eRPs or key collision, an error is
returned when a C-TCAM eRP needs to be created when the eRP state
machine is in its initial state (i.e., 'no masks'). This should only
happen in the face of very unlikely errors when trying to push rules
into the A-TCAM.

In order not to perform unnecessary lookups, the eRP core will only
enable a C-TCAM lookup for a given region if it knows there are C-TCAM
eRPs present.

Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
Registration of a memory region(MR) through FRMR/fastreg(unlike FMR)
needs a connection/qp. With a proxy qp, this dependency on connection
will be removed, but that needs more infrastructure patches, which is a
work in progress.

As an intermediate fix, the get_mr returns EOPNOTSUPP when connection
details are not populated. The MR registration through sendmsg() will
continue to work even with fast registration, since connection in this
case is formed upfront.

This patch fixes the following crash:
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 4244 Comm: syzkaller468044 Not tainted 4.16.0-rc6+ #361
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:rds_ib_get_mr+0x5c/0x230 net/rds/ib_rdma.c:544
RSP: 0018:ffff8801b059f890 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff8801b07e1300 RCX: ffffffff8562d96e
RDX: 000000000000000d RSI: 0000000000000001 RDI: 0000000000000068
RBP: ffff8801b059f8b8 R08: ffffed0036274244 R09: ffff8801b13a1200
R10: 0000000000000004 R11: ffffed0036274243 R12: ffff8801b13a1200
R13: 0000000000000001 R14: ffff8801ca09fa9c R15: 0000000000000000
FS:  00007f4d050af700(0000) GS:ffff8801db300000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4d050aee78 CR3: 00000001b0d9b006 CR4: 00000000001606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 __rds_rdma_map+0x710/0x1050 net/rds/rdma.c:271
 rds_get_mr_for_dest+0x1d4/0x2c0 net/rds/rdma.c:357
 rds_setsockopt+0x6cc/0x980 net/rds/af_rds.c:347
 SYSC_setsockopt net/socket.c:1849 [inline]
 SyS_setsockopt+0x189/0x360 net/socket.c:1828
 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4456d9
RSP: 002b:00007f4d050aedb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00000000006dac3c RCX: 00000000004456d9
RDX: 0000000000000007 RSI: 0000000000000114 RDI: 0000000000000004
RBP: 00000000006dac38 R08: 00000000000000a0 R09: 0000000000000000
R10: 0000000020000380 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fffbfb36d6f R14: 00007f4d050af9c0 R15: 0000000000000005
Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 cc 01 00 00 4c 8b bb 80 04 00 00
48
b8 00 00 00 00 00 fc ff df 49 8d 7f 68 48 89 fa 48 c1 ea 03 <80> 3c 02
00 0f
85 9c 01 00 00 4d 8b 7f 68 48 b8 00 00 00 00 00
RIP: rds_ib_get_mr+0x5c/0x230 net/rds/ib_rdma.c:544 RSP:
ffff8801b059f890
---[ end trace 7e1cea13b85473b0 ]---

Reported-by: [email protected]
Signed-off-by: Santosh Shilimkar <[email protected]>
Signed-off-by: Avinash Repaka <[email protected]>

Signed-off-by: David S. Miller <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
…ilure

While forking, if delayacct init fails due to memory shortage, it
continues expecting all delayacct users to check task->delays pointer
against NULL before dereferencing it, which all of them used to do.

Commit c96f547 ("delayacct: Account blkio completion on the correct
task"), while updating delayacct_blkio_end() to take the target task
instead of always using %current, made the function test NULL on
%current->delays and then continue to operated on @p->delays.  If
%current succeeded init while @p didn't, it leads to the following
crash.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
 IP: __delayacct_blkio_end+0xc/0x40
 PGD 8000001fd07e1067 P4D 8000001fd07e1067 PUD 1fcffbb067 PMD 0
 Oops: 0000 [#1] SMP PTI
 CPU: 4 PID: 25774 Comm: QIOThread0 Not tainted 4.16.0-9_fbk1_rc2_1180_g6b593215b4d7 #9
 RIP: 0010:__delayacct_blkio_end+0xc/0x40
 Call Trace:
  try_to_wake_up+0x2c0/0x600
  autoremove_wake_function+0xe/0x30
  __wake_up_common+0x74/0x120
  wake_up_page_bit+0x9c/0xe0
  mpage_end_io+0x27/0x70
  blk_update_request+0x78/0x2c0
  scsi_end_request+0x2c/0x1e0
  scsi_io_completion+0x20b/0x5f0
  blk_mq_complete_request+0xa2/0x100
  ata_scsi_qc_complete+0x79/0x400
  ata_qc_complete_multiple+0x86/0xd0
  ahci_handle_port_interrupt+0xc9/0x5c0
  ahci_handle_port_intr+0x54/0xb0
  ahci_single_level_irq_intr+0x3b/0x60
  __handle_irq_event_percpu+0x43/0x190
  handle_irq_event_percpu+0x20/0x50
  handle_irq_event+0x2a/0x50
  handle_edge_irq+0x80/0x1c0
  handle_irq+0xaf/0x120
  do_IRQ+0x41/0xc0
  common_interrupt+0xf/0xf

Fix it by updating delayacct_blkio_end() check @p->delays instead.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: c96f547 ("delayacct: Account blkio completion on the correct task")
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Dave Jones <[email protected]>
Debugged-by: Dave Jones <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Josh Snyder <[email protected]>
Cc: <[email protected]>	[4.15+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
VMA.  This is unreliable as ->mmap may not set ->vm_ops.

False-positive vma_is_anonymous() may lead to crashes:

	next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
	prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
	pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
	flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
	------------[ cut here ]------------
	kernel BUG at mm/memory.c:1422!
	invalid opcode: 0000 [#1] SMP KASAN
	CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
	Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
	01/01/2011
	RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
	RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
	RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
	RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
	Call Trace:
	 unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
	 zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
	 unmap_mapping_range_vma mm/memory.c:2792 [inline]
	 unmap_mapping_range_tree mm/memory.c:2813 [inline]
	 unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
	 unmap_mapping_range+0x48/0x60 mm/memory.c:2880
	 truncate_pagecache+0x54/0x90 mm/truncate.c:800
	 truncate_setsize+0x70/0xb0 mm/truncate.c:826
	 simple_setattr+0xe9/0x110 fs/libfs.c:409
	 notify_change+0xf13/0x10f0 fs/attr.c:335
	 do_truncate+0x1ac/0x2b0 fs/open.c:63
	 do_sys_ftruncate+0x492/0x560 fs/open.c:205
	 __do_sys_ftruncate fs/open.c:215 [inline]
	 __se_sys_ftruncate fs/open.c:213 [inline]
	 __x64_sys_ftruncate+0x59/0x80 fs/open.c:213
	 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
	 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reproducer:

	#include <stdio.h>
	#include <stddef.h>
	#include <stdint.h>
	#include <stdlib.h>
	#include <string.h>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <sys/ioctl.h>
	#include <sys/mman.h>
	#include <unistd.h>
	#include <fcntl.h>

	#define KCOV_INIT_TRACE			_IOR('c', 1, unsigned long)
	#define KCOV_ENABLE			_IO('c', 100)
	#define KCOV_DISABLE			_IO('c', 101)
	#define COVER_SIZE			(1024<<10)

	#define KCOV_TRACE_PC  0
	#define KCOV_TRACE_CMP 1

	int main(int argc, char **argv)
	{
		int fd;
		unsigned long *cover;

		system("mount -t debugfs none /sys/kernel/debug");
		fd = open("/sys/kernel/debug/kcov", O_RDWR);
		ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE);
		cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
				PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
		munmap(cover, COVER_SIZE * sizeof(unsigned long));
		cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
				PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
		memset(cover, 0, COVER_SIZE * sizeof(unsigned long));
		ftruncate(fd, 3UL << 20);
		return 0;
	}

This can be fixed by assigning anonymous VMAs own vm_ops and not relying
on it being NULL.

If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
dummy_vm_ops.  This way we will have non-NULL ->vm_ops for all VMAs.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: [email protected]
Acked-by: Linus Torvalds <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
Ido Schimmel says:

====================
mlxsw: Support DSCP prioritization and rewrite

Petr says:

On ingress, a network device such as a switch assigns to packets
priority based on various criteria. Common options include interpreting
PCP and DSCP fields according to user configuration. When a packet
egresses the switch, a reverse process may rewrite PCP and/or DSCP
headers according to packet priority.

So far, mlxsw has supported prioritization based on PCP (802.1p priority
tag). This patch set introduces support for prioritization based on
DSCP, and DSCP rewrite.

To configure the DSCP-to-priority maps, the user is expected to invoke
ieee_setapp and ieee_delapp DCBNL ops, e.g. by using lldptool:

To decide whether or not to pay attention to DSCP values, the Spectrum
switch recognize a per-port configuration of trust level. Until the
first APP rule is added for a given port, this port's trust level stays
at PCP, meaning that PCP is used for packet prioritization. With the
first DSCP APP rule, the port is configured to trust DSCP instead, and
it stays there until all DSCP APP rules are removed again.

Besides the DSCP (value 5) selector, another selector that plays into
packet prioritization is Ethernet type (value 1) with PID of 0. Such APP
entries denote default priority[1]:

With this patch set, mlxsw uses these values to configure priority for
DSCP values not explicitly specified in DSCP APP map. In the future we
expect to also use this to configure default port priority for untagged
packets.

Access to DSCP-to-priority map, priority-to-DSCP map, and default
priority for a port is exposed through three new DCB helpers. Like the
already-existing dcb_ieee_getapp_mask() helper, these helpers operate in
terms of bitmaps, to support the arbitrary M:N mapping that the APP
rules allow. Such interface presents all the relevant information from
the APP database without necessitating exposition of iterators, locking
or other complex primitives. It is up to the driver to then digest the
mapping in a way that the device supports. In this patch set, mlxsw
resolves conflicts by favoring higher-numbered DSCP values and
priorities.

In this patchset:

- Patch #1 fixes a bug in DCB APP database management.
- Patch #2 adds the getters described above.
- Patches #3-#6 add Spectrum configuration registers.
- Patch #7 adds the mlxsw logic that configures the device according to
  APP rules.
- Patch #8 adds a self-test. The test is added to the subdirectory
  drivers/net/mlxsw. Even though it's not particularly specific to
  mlxsw, it's not suitable for running on soft devices (which don't
  support the ieee_getapp et.al.), and thus isn't a good fit for the
  general net/forwarding directory.

[1] 802.1Q-2014, Table D-9
====================

Signed-off-by: David S. Miller <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
bpf_parse_prog() is protected by rcu_read_lock().
so that GFP_KERNEL is not allowed in the bpf_parse_prog().

[51015.579396] =============================
[51015.579418] WARNING: suspicious RCU usage
[51015.579444] 4.18.0-rc6+ #208 Not tainted
[51015.579464] -----------------------------
[51015.579488] ./include/linux/rcupdate.h:303 Illegal context switch in RCU read-side critical section!
[51015.579510] other info that might help us debug this:
[51015.579532] rcu_scheduler_active = 2, debug_locks = 1
[51015.579556] 2 locks held by ip/1861:
[51015.579577]  #0: 00000000a8c12fd1 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x2e0/0x910
[51015.579711]  #1: 00000000bf815f8e (rcu_read_lock){....}, at: lwtunnel_build_state+0x96/0x390
[51015.579842] stack backtrace:
[51015.579869] CPU: 0 PID: 1861 Comm: ip Not tainted 4.18.0-rc6+ #208
[51015.579891] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[51015.579911] Call Trace:
[51015.579950]  dump_stack+0x74/0xbb
[51015.580000]  ___might_sleep+0x16b/0x3a0
[51015.580047]  __kmalloc_track_caller+0x220/0x380
[51015.580077]  kmemdup+0x1c/0x40
[51015.580077]  bpf_parse_prog+0x10e/0x230
[51015.580164]  ? kasan_kmalloc+0xa0/0xd0
[51015.580164]  ? bpf_destroy_state+0x30/0x30
[51015.580164]  ? bpf_build_state+0xe2/0x3e0
[51015.580164]  bpf_build_state+0x1bb/0x3e0
[51015.580164]  ? bpf_parse_prog+0x230/0x230
[51015.580164]  ? lock_is_held_type+0x123/0x1a0
[51015.580164]  lwtunnel_build_state+0x1aa/0x390
[51015.580164]  fib_create_info+0x1579/0x33d0
[51015.580164]  ? sched_clock_local+0xe2/0x150
[51015.580164]  ? fib_info_update_nh_saddr+0x1f0/0x1f0
[51015.580164]  ? sched_clock_local+0xe2/0x150
[51015.580164]  fib_table_insert+0x201/0x1990
[51015.580164]  ? lock_downgrade+0x610/0x610
[51015.580164]  ? fib_table_lookup+0x1920/0x1920
[51015.580164]  ? lwtunnel_valid_encap_type.part.6+0xcb/0x3a0
[51015.580164]  ? rtm_to_fib_config+0x637/0xbd0
[51015.580164]  inet_rtm_newroute+0xed/0x1b0
[51015.580164]  ? rtm_to_fib_config+0xbd0/0xbd0
[51015.580164]  rtnetlink_rcv_msg+0x331/0x910
[ ... ]

Fixes: 3a0af8f ("bpf: BPF for lightweight tunnel infrastructure")
Signed-off-by: Taehee Yoo <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
Kernel panic when with high memory pressure, calltrace looks like,

PID: 21439 TASK: ffff881be3afedd0 CPU: 16 COMMAND: "java"
 #0 [ffff881ec7ed7630] machine_kexec at ffffffff81059beb
 #1 [ffff881ec7ed7690] __crash_kexec at ffffffff81105942
 #2 [ffff881ec7ed7760] crash_kexec at ffffffff81105a30
 #3 [ffff881ec7ed7778] oops_end at ffffffff816902c8
 #4 [ffff881ec7ed77a0] no_context at ffffffff8167ff46
 #5 [ffff881ec7ed77f0] __bad_area_nosemaphore at ffffffff8167ffdc
 #6 [ffff881ec7ed7838] __node_set at ffffffff81680300
 #7 [ffff881ec7ed7860] __do_page_fault at ffffffff8169320f
 #8 [ffff881ec7ed78c0] do_page_fault at ffffffff816932b5
 #9 [ffff881ec7ed78f0] page_fault at ffffffff8168f4c8
    [exception RIP: _raw_spin_lock_irqsave+47]
    RIP: ffffffff8168edef RSP: ffff881ec7ed79a8 RFLAGS: 00010046
    RAX: 0000000000000246 RBX: ffffea0019740d00 RCX: ffff881ec7ed7fd8
    RDX: 0000000000020000 RSI: 0000000000000016 RDI: 0000000000000008
    RBP: ffff881ec7ed79a8 R8: 0000000000000246 R9: 000000000001a098
    R10: ffff88107ffda000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000008 R14: ffff881ec7ed7a80 R15: ffff881be3afedd0
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018

It happens in the pagefault and results in double pagefault
during compacting pages when memory allocation fails.

Analysed the vmcore, the page leads to second pagefault is corrupted
with _mapcount=-256, but private=0.

It's caused by the race between migration and ballooning, and lock
missing in virtballoon_migratepage() of virtio_balloon driver.
This patch fix the bug.

Fixes: e225042 ("virtio_balloon: introduce migration primitives to balloon pages")
Cc: [email protected]
Signed-off-by: Jiang Biao <[email protected]>
Signed-off-by: Huang Chong <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
Petr Machata says:

====================
A test for mirror-to-gretap with team in UL packet path

This patchset adds a test for "tc action mirred mirror" where the
mirrored-to device is a gretap, and underlay path contains a team
device.

In patch #1 require_command() is added, which should henceforth be used
to declare dependence on a certain tool.

In patch #2, two new functions, team_create() and team_destroy(), are
added to lib.sh.

The newly-added test uses arping, which isn't necessarily available.
Therefore patch #3 introduces $ARPING, and a preexisting test is fixed
to require_command $ARPING.

In patches #4 and #5, two new tests are added. In both cases, a team
device is on egress path of a mirrored packet in a mirror-to-gretap
scenario. In the first one, the team device is in loadbalance mode, in
the second one it's in lacp mode. (The difference in modes necessitates
a different testing strategy, hence two test cases instead of just
parameterizing one.)
====================

Signed-off-by: David S. Miller <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
syzbot found that the following sequence produces a LOCKDEP splat [1]

ip link add bond10 type bond
ip link add bond11 type bond
ip link set bond11 master bond10

To fix this, we can use the already provided nest_level.

This patch also provides correct nesting for dev->addr_list_lock

[1]
WARNING: possible recursive locking detected
4.18.0-rc6+ #167 Not tainted
--------------------------------------------
syz-executor751/4439 is trying to acquire lock:
(____ptrval____) (&(&bond->stats_lock)->rlock){+.+.}, at: spin_lock include/linux/spinlock.h:310 [inline]
(____ptrval____) (&(&bond->stats_lock)->rlock){+.+.}, at: bond_get_stats+0xb4/0x560 drivers/net/bonding/bond_main.c:3426

but task is already holding lock:
(____ptrval____) (&(&bond->stats_lock)->rlock){+.+.}, at: spin_lock include/linux/spinlock.h:310 [inline]
(____ptrval____) (&(&bond->stats_lock)->rlock){+.+.}, at: bond_get_stats+0xb4/0x560 drivers/net/bonding/bond_main.c:3426

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&bond->stats_lock)->rlock);
  lock(&(&bond->stats_lock)->rlock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by syz-executor751/4439:
 #0: (____ptrval____) (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20 net/core/rtnetlink.c:77
 #1: (____ptrval____) (&(&bond->stats_lock)->rlock){+.+.}, at: spin_lock include/linux/spinlock.h:310 [inline]
 #1: (____ptrval____) (&(&bond->stats_lock)->rlock){+.+.}, at: bond_get_stats+0xb4/0x560 drivers/net/bonding/bond_main.c:3426
 #2: (____ptrval____) (rcu_read_lock){....}, at: bond_get_stats+0x0/0x560 include/linux/compiler.h:215

stack backtrace:
CPU: 0 PID: 4439 Comm: syz-executor751 Not tainted 4.18.0-rc6+ #167
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 print_deadlock_bug kernel/locking/lockdep.c:1765 [inline]
 check_deadlock kernel/locking/lockdep.c:1809 [inline]
 validate_chain kernel/locking/lockdep.c:2405 [inline]
 __lock_acquire.cold.64+0x1fb/0x486 kernel/locking/lockdep.c:3435
 lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
 _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:144
 spin_lock include/linux/spinlock.h:310 [inline]
 bond_get_stats+0xb4/0x560 drivers/net/bonding/bond_main.c:3426
 dev_get_stats+0x10f/0x470 net/core/dev.c:8316
 bond_get_stats+0x232/0x560 drivers/net/bonding/bond_main.c:3432
 dev_get_stats+0x10f/0x470 net/core/dev.c:8316
 rtnl_fill_stats+0x4d/0xac0 net/core/rtnetlink.c:1169
 rtnl_fill_ifinfo+0x1aa6/0x3fb0 net/core/rtnetlink.c:1611
 rtmsg_ifinfo_build_skb+0xc8/0x190 net/core/rtnetlink.c:3268
 rtmsg_ifinfo_event.part.30+0x45/0xe0 net/core/rtnetlink.c:3300
 rtmsg_ifinfo_event net/core/rtnetlink.c:3297 [inline]
 rtnetlink_event+0x144/0x170 net/core/rtnetlink.c:4716
 notifier_call_chain+0x180/0x390 kernel/notifier.c:93
 __raw_notifier_call_chain kernel/notifier.c:394 [inline]
 raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
 call_netdevice_notifiers_info+0x3f/0x90 net/core/dev.c:1735
 call_netdevice_notifiers net/core/dev.c:1753 [inline]
 netdev_features_change net/core/dev.c:1321 [inline]
 netdev_change_features+0xb3/0x110 net/core/dev.c:7759
 bond_compute_features.isra.47+0x585/0xa50 drivers/net/bonding/bond_main.c:1120
 bond_enslave+0x1b25/0x5da0 drivers/net/bonding/bond_main.c:1755
 bond_do_ioctl+0x7cb/0xae0 drivers/net/bonding/bond_main.c:3528
 dev_ifsioc+0x43c/0xb30 net/core/dev_ioctl.c:327
 dev_ioctl+0x1b5/0xcc0 net/core/dev_ioctl.c:493
 sock_do_ioctl+0x1d3/0x3e0 net/socket.c:992
 sock_ioctl+0x30d/0x680 net/socket.c:1093
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:500 [inline]
 do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:684
 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
 __do_sys_ioctl fs/ioctl.c:708 [inline]
 __se_sys_ioctl fs/ioctl.c:706 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:706
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x440859
Code: e8 2c af 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b 10 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc51a92878 EFLAGS: 00000213 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000440859
RDX: 0000000020000040 RSI: 0000000000008990 RDI: 0000000000000003
RBP: 0000000000000000 R08: 00000000004002c8 R09: 00000000004002c8
R10: 00000000022d5880 R11: 0000000000000213 R12: 0000000000007390
R13: 0000000000401db0 R14: 0000000000000000 R15: 0000000000000000

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Jay Vosburgh <[email protected]>
Cc: Veaceslav Falico <[email protected]>
Cc: Andy Gospodarek <[email protected]>

Signed-off-by: David S. Miller <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
Petr Machata says:

====================
ipv4: Control SKB reprioritization after forwarding

After IPv4 packets are forwarded, the priority of the corresponding SKB
is updated according to the TOS field of IPv4 header. This overrides any
prioritization done earlier by e.g. an skbedit action or ingress-qos-map
defined at a vlan device.

Such overriding may not always be desirable. Even if the packet ends up
being routed, which implies this is an L3 network node, an administrator
may wish to preserve whatever prioritization was done earlier on in the
pipeline.

Therefore this patch set introduces a sysctl that controls this
behavior, net.ipv4.ip_forward_update_priority. It's value is 1 by
default to preserve the current behavior.

All of the above is implemented in patch #1.

Value changes prompt a new NETEVENT_IPV4_FWD_UPDATE_PRIORITY_UPDATE
notification, so that the drivers can hook up whatever logic may depend
on this value. That is implemented in patch #2.

In patches #3 and #4, mlxsw is adapted to recognize the sysctl. On
initialization, the RGCR register that handles router configuration is
set in accordance with the sysctl. The new notification is listened to
and RGCR is reconfigured as necessary.

In patches #5 to #7, a selftest is added to verify that mlxsw reflects
the sysctl value as necessary. The test is expressed in terms of the
recently-introduced ieee_setapp support, and works by observing how DSCP
value gets rewritten depending on packet priority. For this reason, the
test is added to the subdirectory drivers/net/mlxsw. Even though it's
not particularly specific to mlxsw, it's not suitable for running on
soft devices (which don't support the ieee_setapp et.al.).

Changes from v1 to v2:

- In patch #1, init sysctl_ip_fwd_update_priority to 1 instead of true.

Changes from RFC to v1:

- Fix wrong sysctl name in ip-sysctl.txt
- Add notifications
- Add mlxsw support
- Add self test
====================

Signed-off-by: David S. Miller <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
Amit Pundir and Youling in parallel reported crashes with recent
mainline kernels running Android:

  F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
  F DEBUG   : Build fingerprint: 'Android/db410c32_only/db410c32_only:Q/OC-MR1/102:userdebug/test-key
  F DEBUG   : Revision: '0'
  F DEBUG   : ABI: 'arm'
  F DEBUG   : pid: 2261, tid: 2261, name: zygote  >>> zygote <<<
  F DEBUG   : signal 7 (SIGBUS), code 2 (BUS_ADRERR), fault addr 0xec00008
  ... <snip> ...
  F DEBUG   : backtrace:
  F DEBUG   :     #00 pc 00001c04  /system/lib/libc.so (memset+48)
  F DEBUG   :     #1 pc 0010c513  /system/lib/libart.so (create_mspace_with_base+82)
  F DEBUG   :     #2 pc 0015c601  /system/lib/libart.so (art::gc::space::DlMallocSpace::CreateMspace(void*, unsigned int, unsigned int)+40)
  F DEBUG   :     #3 pc 0015c3ed  /system/lib/libart.so (art::gc::space::DlMallocSpace::CreateFromMemMap(art::MemMap*, std::__1::basic_string<char, std::__ 1::char_traits<char>, std::__1::allocator<char>> const&, unsigned int, unsigned int, unsigned int, unsigned int, bool)+36)
  ...

This was bisected back to commit bfd40ea ("mm: fix
vma_is_anonymous() false-positives").

create_mspace_with_base() in the trace above, utilizes ashmem, and with
ashmem, for shared mappings we use shmem_zero_setup(), which sets the
vma->vm_ops to &shmem_vm_ops.  But for private ashmem mappings nothing
sets the vma->vm_ops.

Looking at the problematic patch, it seems to add a requirement that one
call vma_set_anonymous() on a vma, otherwise the dummy_vm_ops will be
used.  Using the dummy_vm_ops seem to triggger SIGBUS when traversing
unmapped pages.

Thus, this patch adds a call to vma_set_anonymous() for ashmem private
mappings and seems to avoid the reported problem.

Fixes: bfd40ea ("mm: fix vma_is_anonymous() false-positives")
Cc: Kirill Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Reported-by: Amit Pundir <[email protected]>
Reported-by: Youling 257 <[email protected]>
Signed-off-by: John Stultz <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
Add the following verifier tests to cover the cgroup storage
functionality:
1) valid access to the cgroup storage
2) invalid access: use regular hashmap instead of cgroup storage map
3) invalid access: use invalid map fd
4) invalid access: try access memory after the cgroup storage
5) invalid access: try access memory before the cgroup storage
6) invalid access: call get_local_storage() with non-zero flags

For tests 2)-6) check returned error strings.

Expected output:
  $ ./test_verifier
  #0/u add+sub+mul OK
  #0/p add+sub+mul OK
  #1/u DIV32 by 0, zero check 1 OK
  ...
  #280/p valid cgroup storage access OK
  #281/p invalid cgroup storage access 1 OK
  #282/p invalid cgroup storage access 2 OK
  #283/p invalid per-cgroup storage access 3 OK
  #284/p invalid cgroup storage access 4 OK
  #285/p invalid cgroup storage access 5 OK
  ...
  #649/p pass modified ctx pointer to helper, 2 OK
  #650/p pass modified ctx pointer to helper, 3 OK
  Summary: 901 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Roman Gushchin <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
Guillaume Nault says:

====================
l2tp: sanitise MTU handling on sessions

Most of the code handling sessions' MTU has no effect. The ->mtu field
in struct l2tp_session might be used at session creation time, but
neither PPP nor Ethernet pseudo-wires take updates into account.

L2TP sessions don't have a concept of MTU, which is the reason why
->mtu is mostly ignored. MTU should remain a network device thing.
Therefore this patch set does not try to propagate/update ->mtu to/from
the device. That would complicate the code unnecessarily. Instead this
field and the associated ioctl commands and netlink attributes are
removed.

Patch #1 defines l2tp_tunnel_dst_mtu() in order to simplify the
following patches. Then patches #2 and #3 remove MTU handling from PPP
and Ethernet pseudo-wires respectively.
====================

Signed-off-by: David S. Miller <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
Ido Schimmel says:

====================
mlxsw: Enable MC-aware mode for mlxsw ports

Petr says:

Due to an issue in Spectrum chips, when unicast traffic shares the same
queue as BUM traffic, and there is a congestion, the BUM traffic is
admitted to the queue anyway, thus pushing out all UC traffic. In order
to give unicast traffic precedence over BUM traffic, configure
multicast-aware mode on all ports.

Under multicast-aware regime, when assigning traffic class to a packet,
the switch doesn't merely take the value prescribed by the QTCT
register. For BUM traffic, it instead assigns that value plus 8. That
limits the number of available TCs, but since mlxsw currently only uses
the lower eight anyway, it is no real loss.

The two TCs (UC and MC one) are then mapped to the same subgroup and
strictly prioritized so that UC traffic is preferred in case of
congestion.

In patch #1, introduce a new register, QTCTM, which enables the
multicast-aware mode.

In patch #2, fix a typo in related code.

In patch #3, set up TCs and QTCTM to enable multicast-aware mode.
====================

Signed-off-by: David S. Miller <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
The shift of 'cwnd' with '(now - hc->tx_lsndtime) / hc->tx_rto' value
can lead to undefined behavior [1].

In order to fix this use a gradual shift of the window with a 'while'
loop, similar to what tcp_cwnd_restart() is doing.

When comparing delta and RTO there is a minor difference between TCP
and DCCP, the last one also invokes dccp_cwnd_restart() and reduces
'cwnd' if delta equals RTO. That case is preserved in this change.

[1]:
[40850.963623] UBSAN: Undefined behaviour in net/dccp/ccids/ccid2.c:237:7
[40851.043858] shift exponent 67 is too large for 32-bit type 'unsigned int'
[40851.127163] CPU: 3 PID: 15940 Comm: netstress Tainted: G        W   E     4.18.0-rc7.x86_64 #1
...
[40851.377176] Call Trace:
[40851.408503]  dump_stack+0xf1/0x17b
[40851.451331]  ? show_regs_print_info+0x5/0x5
[40851.503555]  ubsan_epilogue+0x9/0x7c
[40851.548363]  __ubsan_handle_shift_out_of_bounds+0x25b/0x2b4
[40851.617109]  ? __ubsan_handle_load_invalid_value+0x18f/0x18f
[40851.686796]  ? xfrm4_output_finish+0x80/0x80
[40851.739827]  ? lock_downgrade+0x6d0/0x6d0
[40851.789744]  ? xfrm4_prepare_output+0x160/0x160
[40851.845912]  ? ip_queue_xmit+0x810/0x1db0
[40851.895845]  ? ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
[40851.963530]  ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
[40852.029063]  dccp_xmit_packet+0x1d3/0x720 [dccp]
[40852.086254]  dccp_write_xmit+0x116/0x1d0 [dccp]
[40852.142412]  dccp_sendmsg+0x428/0xb20 [dccp]
[40852.195454]  ? inet_dccp_listen+0x200/0x200 [dccp]
[40852.254833]  ? sched_clock+0x5/0x10
[40852.298508]  ? sched_clock+0x5/0x10
[40852.342194]  ? inet_create+0xdf0/0xdf0
[40852.388988]  sock_sendmsg+0xd9/0x160
...

Fixes: 113ced1 ("dccp ccid-2: Perform congestion-window validation")
Signed-off-by: Alexey Kodanev <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
mjmartineau pushed a commit that referenced this issue Aug 10, 2018
The definition of static_key_slow_inc() has cpus_read_lock in place. In the
virtio_net driver, XPS queues are initialized after setting the queue:cpu
affinity in virtnet_set_affinity() which is already protected within
cpus_read_lock. Lockdep prints a warning when we are trying to acquire
cpus_read_lock when it is already held.

This patch adds an ability to call __netif_set_xps_queue under
cpus_read_lock().
Acked-by: Jason Wang <[email protected]>

============================================
WARNING: possible recursive locking detected
4.18.0-rc3-next-20180703+ #1 Not tainted
--------------------------------------------
swapper/0/1 is trying to acquire lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: static_key_slow_inc+0xe/0x20

but task is already holding lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(cpu_hotplug_lock.rw_sem);
  lock(cpu_hotplug_lock.rw_sem);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by swapper/0/1:
 #0: 00000000244bc7da (&dev->mutex){....}, at: __driver_attach+0x5a/0x110
 #1: 00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
 #2: 000000005cd8463f (xps_map_mutex){+.+.}, at: __netif_set_xps_queue+0x8d/0xc60

v2: move cpus_read_lock() out of __netif_set_xps_queue()

Cc: "Nambiar, Amritha" <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Fixes: 8af2c06 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")

Signed-off-by: Andrei Vagin <[email protected]>

Signed-off-by: David S. Miller <[email protected]>
mjmartineau pushed a commit that referenced this issue Sep 12, 2018
…_read

Subflows can get removed from under our feet, thus we might be iterating
on garbage here.

That can panic like:

[52899.160112] BUG: unable to handle kernel NULL pointer dereference at           (null)
[52899.160157] IP: tcp_splice_read+0x225/0x330
[52899.160166] PGD 8000000164ff8067 P4D 8000000164ff8067 PUD 163d67067 PMD 0
[52899.160189] Oops: 0000 [#1] SMP PTI
[52899.160198] Modules linked in: binfmt_misc xt_REDIRECT nf_nat_redirect xt_statistic xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_connmark xt_nat xt_comment xt_geoip(O) xt_conntrack iptable_mangle iptable_nat nf_nat_ipv4 nf_nat iptable_filter sch_fq_codel nf_conntrack_tftp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_proto_gre nf_conntrack_irc nf_conntrack_ftp nf_conntrack pcspkr tun it87 hwmon_vid vfat fat x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd iTCO_wdt i2c_designware_platform iTCO_vendor_support i2c_designware_core intel_cstate intel_rapl_perf idma64 i2c_i801 sg virt_dma pinctrl_sunrisepoint wmi pinctrl_intel acpi_pad intel_lpss_pci intel_lpss mei_me pcc_cpufreq
[52899.160410]  mfd_core intel_pch_thermal mei shpchp ip_tables xfs libcrc32c sd_mod crc32c_intel igb ptp sdhci_pci pps_core sdhci dca i915 ahci i2c_algo_bit mmc_core libahci drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops libata drm video [last unloaded: pcspkr]
[52899.160495] CPU: 0 PID: 21938 Comm: redsocks Tainted: G           O    4.14.64+ #1
[52899.160506] Hardware name: Default string Default string/Default string, BIOS 5.12 07/01/2018
[52899.160518] task: ffff880164d45e00 task.stack: ffffc90002824000
[52899.160536] RIP: 0010:tcp_splice_read+0x225/0x330
[52899.160546] RSP: 0018:ffffc90002827dd8 EFLAGS: 00010286
[52899.160558] RAX: 0000000000000000 RBX: ffff88015b965280 RCX: 0000000000100000
[52899.160569] RDX: ffff88015ba30180 RSI: ffffc90002827ee8 RDI: ffff8801164d82c0
[52899.160579] RBP: ffffc90002827e50 R08: 00000000ffffffff R09: 0000000000000000
[52899.160589] R10: ffff880163940f00 R11: ffff88015ba30180 R12: ffff8801164d82c0
[52899.160599] R13: ffff88015ba30180 R14: ffffc90002827ee8 R15: 0000000000000003
[52899.160611] FS:  00007fae07922740(0000) GS:ffff88016ec00000(0000) knlGS:0000000000000000
[52899.160624] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[52899.160634] CR2: 0000000000000000 CR3: 0000000164c9a006 CR4: 00000000003606f0
[52899.160646] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[52899.160656] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[52899.160664] Call Trace:
[52899.160684]  ? kmem_cache_free+0x1aa/0x1c0
[52899.160702]  sock_splice_read+0x25/0x30
[52899.160719]  do_splice_to+0x76/0x90
[52899.160735]  SyS_splice+0x6fd/0x750
[52899.160750]  ? syscall_trace_enter+0x1cd/0x2b0
[52899.160766]  do_syscall_64+0x79/0x1b0
[52899.160784]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[52899.160795] RIP: 0033:0x7fae07213493
[52899.160804] RSP: 002b:00007fff48ffdfe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
[52899.160819] RAX: ffffffffffffffda RBX: 00007fff48ffe040 RCX: 00007fae07213493
[52899.160829] RDX: 000000000000008b RSI: 0000000000000000 RDI: 0000000000000081
[52899.160839] RBP: 0000000000000081 R08: 0000000000100000 R09: 0000000000000003
[52899.160849] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002214d40
[52899.160859] R13: 0000000000000081 R14: 0000000000000002 R15: 00000000022124f0
[52899.160870] Code: ff ff 48 8b 83 d0 07 00 00 48 8b 00 48 85 c0 0f 84 33 fe ff ff 44 8b 05 7a eb d3 00 41 f7 d0 0f 1f 44 00 00 48 8b 80 e8 07 00 00 <48> 8b 00 48 85 c0 75 ec e9 10 fe ff ff 0f b6 43 12 3c 01 0f 85
[52899.161029] RIP: tcp_splice_read+0x225/0x330 RSP: ffffc90002827dd8
[52899.161038] CR2: 0000000000000000

Github-issue: multipath-tcp/mptcp#279

Fixes: ee4f8f6 ("Support tcp_read_sock")
Reported-by: https://github.com/wapsi
Signed-off-by: Christoph Paasch <[email protected]>
Signed-off-by: Matthieu Baerts <[email protected]>
(cherry picked from commit 80671d2)
Signed-off-by: Christoph Paasch <[email protected]>
mjmartineau pushed a commit that referenced this issue Sep 28, 2018
A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.

test commands:
   %iptables -t nat -I POSTROUTING -j MASQUERADE
   %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000

splat looks like:
[  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
[  261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
[  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
[  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
[  261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
[  261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
[  261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
[  261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
[  261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
[  261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
[  261.174169] FS:  00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
[  261.183012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
[  261.198158] Call Trace:
[  261.199018]  ? dst_output+0x180/0x180
[  261.205011]  ? save_trace+0x300/0x300
[  261.209018]  ? ip_copy_metadata+0xb00/0xb00
[  261.213034]  ? sched_clock_local+0xd4/0x140
[  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
[  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
[  261.227014]  ? find_held_lock+0x39/0x1c0
[  261.233008]  ip_finish_output+0x51d/0xb50
[  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
[  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
[  261.250152]  ? rcu_is_watching+0x77/0x120
[  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
[  261.261033]  ? nf_hook_slow+0xb1/0x160
[  261.265007]  ip_output+0x1c7/0x710
[  261.269005]  ? ip_mc_output+0x13f0/0x13f0
[  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
[  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
[  261.282996]  ? nf_hook_slow+0xb1/0x160
[  261.287007]  raw_sendmsg+0x21f9/0x4420
[  261.291008]  ? dst_output+0x180/0x180
[  261.297003]  ? sched_clock_cpu+0x126/0x170
[  261.301003]  ? find_held_lock+0x39/0x1c0
[  261.306155]  ? stop_critical_timings+0x420/0x420
[  261.311004]  ? check_flags.part.36+0x450/0x450
[  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.326142]  ? cyc2ns_read_end+0x10/0x10
[  261.330139]  ? raw_bind+0x280/0x280
[  261.334138]  ? sched_clock_cpu+0x126/0x170
[  261.338995]  ? check_flags.part.36+0x450/0x450
[  261.342991]  ? __lock_acquire+0x4500/0x4500
[  261.348994]  ? inet_sendmsg+0x11c/0x500
[  261.352989]  ? dst_output+0x180/0x180
[  261.357012]  inet_sendmsg+0x11c/0x500
[ ... ]

v2:
 - clear skb->sk at reassembly routine.(Eric Dumarzet)

Fixes: fa0f527 ("ip: use rb trees for IP frag queue.")
Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: Taehee Yoo <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
mjmartineau pushed a commit that referenced this issue Sep 28, 2018
The following lockdep report can be triggered by writing to /sys/kernel/debug/sched_features:

  ======================================================
  WARNING: possible circular locking dependency detected
  4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18 Not tainted
  ------------------------------------------------------
  sh/3358 is trying to acquire lock:
  000000004ad3989d (cpu_hotplug_lock.rw_sem){++++}, at: static_key_enable+0x14/0x30
  but task is already holding lock:
  00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
  which lock already depends on the new lock.
  the existing dependency chain (in reverse order) is:
  -> #3 (&sb->s_type->i_mutex_key#3){+.+.}:
         lock_acquire+0xb8/0x148
         down_write+0xac/0x140
         start_creating+0x5c/0x168
         debugfs_create_dir+0x18/0x220
         opp_debug_register+0x8c/0x120
         _add_opp_dev+0x104/0x1f8
         dev_pm_opp_get_opp_table+0x174/0x340
         _of_add_opp_table_v2+0x110/0x760
         dev_pm_opp_of_add_table+0x5c/0x240
         dev_pm_opp_of_cpumask_add_table+0x5c/0x100
         cpufreq_init+0x160/0x430
         cpufreq_online+0x1cc/0xe30
         cpufreq_add_dev+0x78/0x198
         subsys_interface_register+0x168/0x270
         cpufreq_register_driver+0x1c8/0x278
         dt_cpufreq_probe+0xdc/0x1b8
         platform_drv_probe+0xb4/0x168
         driver_probe_device+0x318/0x4b0
         __device_attach_driver+0xfc/0x1f0
         bus_for_each_drv+0xf8/0x180
         __device_attach+0x164/0x200
         device_initial_probe+0x10/0x18
         bus_probe_device+0x110/0x178
         device_add+0x6d8/0x908
         platform_device_add+0x138/0x3d8
         platform_device_register_full+0x1cc/0x1f8
         cpufreq_dt_platdev_init+0x174/0x1bc
         do_one_initcall+0xb8/0x310
         kernel_init_freeable+0x4b8/0x56c
         kernel_init+0x10/0x138
         ret_from_fork+0x10/0x18
  -> #2 (opp_table_lock){+.+.}:
         lock_acquire+0xb8/0x148
         __mutex_lock+0x104/0xf50
         mutex_lock_nested+0x1c/0x28
         _of_add_opp_table_v2+0xb4/0x760
         dev_pm_opp_of_add_table+0x5c/0x240
         dev_pm_opp_of_cpumask_add_table+0x5c/0x100
         cpufreq_init+0x160/0x430
         cpufreq_online+0x1cc/0xe30
         cpufreq_add_dev+0x78/0x198
         subsys_interface_register+0x168/0x270
         cpufreq_register_driver+0x1c8/0x278
         dt_cpufreq_probe+0xdc/0x1b8
         platform_drv_probe+0xb4/0x168
         driver_probe_device+0x318/0x4b0
         __device_attach_driver+0xfc/0x1f0
         bus_for_each_drv+0xf8/0x180
         __device_attach+0x164/0x200
         device_initial_probe+0x10/0x18
         bus_probe_device+0x110/0x178
         device_add+0x6d8/0x908
         platform_device_add+0x138/0x3d8
         platform_device_register_full+0x1cc/0x1f8
         cpufreq_dt_platdev_init+0x174/0x1bc
         do_one_initcall+0xb8/0x310
         kernel_init_freeable+0x4b8/0x56c
         kernel_init+0x10/0x138
         ret_from_fork+0x10/0x18
  -> #1 (subsys mutex#6){+.+.}:
         lock_acquire+0xb8/0x148
         __mutex_lock+0x104/0xf50
         mutex_lock_nested+0x1c/0x28
         subsys_interface_register+0xd8/0x270
         cpufreq_register_driver+0x1c8/0x278
         dt_cpufreq_probe+0xdc/0x1b8
         platform_drv_probe+0xb4/0x168
         driver_probe_device+0x318/0x4b0
         __device_attach_driver+0xfc/0x1f0
         bus_for_each_drv+0xf8/0x180
         __device_attach+0x164/0x200
         device_initial_probe+0x10/0x18
         bus_probe_device+0x110/0x178
         device_add+0x6d8/0x908
         platform_device_add+0x138/0x3d8
         platform_device_register_full+0x1cc/0x1f8
         cpufreq_dt_platdev_init+0x174/0x1bc
         do_one_initcall+0xb8/0x310
         kernel_init_freeable+0x4b8/0x56c
         kernel_init+0x10/0x138
         ret_from_fork+0x10/0x18
  -> #0 (cpu_hotplug_lock.rw_sem){++++}:
         __lock_acquire+0x203c/0x21d0
         lock_acquire+0xb8/0x148
         cpus_read_lock+0x58/0x1c8
         static_key_enable+0x14/0x30
         sched_feat_write+0x314/0x428
         full_proxy_write+0xa0/0x138
         __vfs_write+0xd8/0x388
         vfs_write+0xdc/0x318
         ksys_write+0xb4/0x138
         sys_write+0xc/0x18
         __sys_trace_return+0x0/0x4
  other info that might help us debug this:
  Chain exists of:
    cpu_hotplug_lock.rw_sem --> opp_table_lock --> &sb->s_type->i_mutex_key#3
   Possible unsafe locking scenario:
         CPU0                    CPU1
         ----                    ----
    lock(&sb->s_type->i_mutex_key#3);
                                 lock(opp_table_lock);
                                 lock(&sb->s_type->i_mutex_key#3);
    lock(cpu_hotplug_lock.rw_sem);
   *** DEADLOCK ***
  2 locks held by sh/3358:
   #0: 00000000a8c4b363 (sb_writers#10){.+.+}, at: vfs_write+0x238/0x318
   #1: 00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
  stack backtrace:
  CPU: 5 PID: 3358 Comm: sh Not tainted 4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18
  Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT)
  Call trace:
   dump_backtrace+0x0/0x288
   show_stack+0x14/0x20
   dump_stack+0x13c/0x1ac
   print_circular_bug.isra.10+0x270/0x438
   check_prev_add.constprop.16+0x4dc/0xb98
   __lock_acquire+0x203c/0x21d0
   lock_acquire+0xb8/0x148
   cpus_read_lock+0x58/0x1c8
   static_key_enable+0x14/0x30
   sched_feat_write+0x314/0x428
   full_proxy_write+0xa0/0x138
   __vfs_write+0xd8/0x388
   vfs_write+0xdc/0x318
   ksys_write+0xb4/0x138
   sys_write+0xc/0x18
   __sys_trace_return+0x0/0x4

This is because when loading the cpufreq_dt module we first acquire
cpu_hotplug_lock.rw_sem lock, then in cpufreq_init(), we are taking
the &sb->s_type->i_mutex_key lock.

But when writing to /sys/kernel/debug/sched_features, the
cpu_hotplug_lock.rw_sem lock depends on the &sb->s_type->i_mutex_key lock.

To fix this bug, reverse the lock acquisition order when writing to
sched_features, this way cpu_hotplug_lock.rw_sem no longer depends on
&sb->s_type->i_mutex_key.

Tested-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Jiada Wang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Eugeniu Rosca <[email protected]>
Cc: George G. Davis <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
mjmartineau pushed a commit that referenced this issue Sep 28, 2018
In case local OOB data was generated and other device initiated pairing
claiming that it has got OOB data, following crash occurred:

[  222.847853] general protection fault: 0000 [#1] SMP PTI
[  222.848025] CPU: 1 PID: 42 Comm: kworker/u5:0 Tainted: G         C        4.18.0-custom #4
[  222.848158] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  222.848307] Workqueue: hci0 hci_rx_work [bluetooth]
[  222.848416] RIP: 0010:compute_ecdh_secret+0x5a/0x270 [bluetooth]
[  222.848540] Code: 0c af f5 48 8b 3d 46 de f0 f6 ba 40 00 00 00 be c0 00 60 00 e8 b7 7b c5 f5 48 85 c0 0f 84 ea 01 00 00 48 89 c3 e8 16 0c af f5 <49> 8b 47 38 be c0 00 60 00 8b 78 f8 48 83 c7 48 e8 51 84 c5 f5 48
[  222.848914] RSP: 0018:ffffb1664087fbc0 EFLAGS: 00010293
[  222.849021] RAX: ffff8a5750d7dc00 RBX: ffff8a5671096780 RCX: ffffffffc08bc32a
[  222.849111] RDX: 0000000000000000 RSI: 00000000006000c0 RDI: ffff8a5752003800
[  222.849192] RBP: ffffb1664087fc60 R08: ffff8a57525280a0 R09: ffff8a5752003800
[  222.849269] R10: ffffb1664087fc70 R11: 0000000000000093 R12: ffff8a5674396e00
[  222.849350] R13: ffff8a574c2e79aa R14: ffff8a574c2e796a R15: 020e0e100d010101
[  222.849429] FS:  0000000000000000(0000) GS:ffff8a5752500000(0000) knlGS:0000000000000000
[  222.849518] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  222.849586] CR2: 000055856016a038 CR3: 0000000110d2c005 CR4: 00000000000606e0
[  222.849671] Call Trace:
[  222.849745]  ? sc_send_public_key+0x110/0x2a0 [bluetooth]
[  222.849825]  ? sc_send_public_key+0x115/0x2a0 [bluetooth]
[  222.849925]  smp_recv_cb+0x959/0x2490 [bluetooth]
[  222.850023]  ? _cond_resched+0x19/0x40
[  222.850105]  ? mutex_lock+0x12/0x40
[  222.850202]  l2cap_recv_frame+0x109d/0x3420 [bluetooth]
[  222.850315]  ? l2cap_recv_frame+0x109d/0x3420 [bluetooth]
[  222.850426]  ? __switch_to_asm+0x34/0x70
[  222.850515]  ? __switch_to_asm+0x40/0x70
[  222.850625]  ? __switch_to_asm+0x34/0x70
[  222.850724]  ? __switch_to_asm+0x40/0x70
[  222.850786]  ? __switch_to_asm+0x34/0x70
[  222.850846]  ? __switch_to_asm+0x40/0x70
[  222.852581]  ? __switch_to_asm+0x34/0x70
[  222.854976]  ? __switch_to_asm+0x40/0x70
[  222.857475]  ? __switch_to_asm+0x40/0x70
[  222.859775]  ? __switch_to_asm+0x34/0x70
[  222.861218]  ? __switch_to_asm+0x40/0x70
[  222.862327]  ? __switch_to_asm+0x34/0x70
[  222.863758]  l2cap_recv_acldata+0x266/0x3c0 [bluetooth]
[  222.865122]  hci_rx_work+0x1c9/0x430 [bluetooth]
[  222.867144]  process_one_work+0x210/0x4c0
[  222.868248]  worker_thread+0x41/0x4d0
[  222.869420]  kthread+0x141/0x160
[  222.870694]  ? process_one_work+0x4c0/0x4c0
[  222.871668]  ? kthread_create_worker_on_cpu+0x90/0x90
[  222.872896]  ret_from_fork+0x35/0x40
[  222.874132] Modules linked in: algif_hash algif_skcipher af_alg rfcomm bnep btusb btrtl btbcm btintel snd_intel8x0 cmac intel_rapl_perf vboxvideo(C) snd_ac97_codec bluetooth ac97_bus joydev ttm snd_pcm ecdh_generic drm_kms_helper snd_timer snd input_leds drm serio_raw fb_sys_fops soundcore syscopyarea sysfillrect sysimgblt mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper ahci psmouse libahci i2c_piix4 video e1000 pata_acpi
[  222.883153] fbcon_switch: detected unhandled fb_set_par error, error code -16
[  222.886774] fbcon_switch: detected unhandled fb_set_par error, error code -16
[  222.890503] ---[ end trace 6504aa7a777b5316 ]---
[  222.890541] RIP: 0010:compute_ecdh_secret+0x5a/0x270 [bluetooth]
[  222.890551] Code: 0c af f5 48 8b 3d 46 de f0 f6 ba 40 00 00 00 be c0 00 60 00 e8 b7 7b c5 f5 48 85 c0 0f 84 ea 01 00 00 48 89 c3 e8 16 0c af f5 <49> 8b 47 38 be c0 00 60 00 8b 78 f8 48 83 c7 48 e8 51 84 c5 f5 48
[  222.890555] RSP: 0018:ffffb1664087fbc0 EFLAGS: 00010293
[  222.890561] RAX: ffff8a5750d7dc00 RBX: ffff8a5671096780 RCX: ffffffffc08bc32a
[  222.890565] RDX: 0000000000000000 RSI: 00000000006000c0 RDI: ffff8a5752003800
[  222.890571] RBP: ffffb1664087fc60 R08: ffff8a57525280a0 R09: ffff8a5752003800
[  222.890576] R10: ffffb1664087fc70 R11: 0000000000000093 R12: ffff8a5674396e00
[  222.890581] R13: ffff8a574c2e79aa R14: ffff8a574c2e796a R15: 020e0e100d010101
[  222.890586] FS:  0000000000000000(0000) GS:ffff8a5752500000(0000) knlGS:0000000000000000
[  222.890591] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  222.890594] CR2: 000055856016a038 CR3: 0000000110d2c005 CR4: 00000000000606e0

This commit fixes a bug where invalid pointer to crypto tfm was used for
SMP SC ECDH calculation when OOB was in use. Solution is to use same
crypto tfm than when generating OOB material on generate_oob() function.

This bug was introduced in commit c0153b0 ("Bluetooth: let the crypto
subsystem generate the ecc privkey"). Bug was found by fuzzing kernel SMP
implementation using Synopsys Defensics.

Signed-off-by: Matias Karhumaa <[email protected]>
Signed-off-by: Johan Hedberg <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>
mjmartineau pushed a commit that referenced this issue Sep 28, 2018
Yonghong Song says:

====================
The support to dump program array and map_in_map maps
for bpffs and bpftool is added. Patch #1 added bpffs support
and Patch #2 added bpftool support. Please see
individual patches for example output.
====================

Signed-off-by: Alexei Starovoitov <[email protected]>
mjmartineau pushed a commit that referenced this issue Sep 28, 2018
There is RaceFuzzer report like below because we have no lock to close
below the race between binder_mmap and binder_alloc_new_buf_locked.
To close the race, let's use memory barrier so that if someone see
alloc->vma is not NULL, alloc->vma_vm_mm should be never NULL.

(I didn't add stable mark intentionallybecause standard android
userspace libraries that interact with binder (libbinder & libhwbinder)
prevent the mmap/ioctl race. - from Todd)

"
Thread interleaving:
CPU0 (binder_alloc_mmap_handler)              CPU1 (binder_alloc_new_buf_locked)
=====                                         =====
// drivers/android/binder_alloc.c
// #L718 (v4.18-rc3)
alloc->vma = vma;
                                              // drivers/android/binder_alloc.c
                                              // #L346 (v4.18-rc3)
                                              if (alloc->vma == NULL) {
                                                  ...
                                                  // alloc->vma is not NULL at this point
                                                  return ERR_PTR(-ESRCH);
                                              }
                                              ...
                                              // #L438
                                              binder_update_page_range(alloc, 0,
                                                      (void *)PAGE_ALIGN((uintptr_t)buffer->data),
                                                      end_page_addr);

                                              // In binder_update_page_range() #L218
                                              // But still alloc->vma_vm_mm is NULL here
                                              if (need_mm && mmget_not_zero(alloc->vma_vm_mm))
alloc->vma_vm_mm = vma->vm_mm;

Crash Log:
==================================================================
BUG: KASAN: null-ptr-deref in __atomic_add_unless include/asm-generic/atomic-instrumented.h:89 [inline]
BUG: KASAN: null-ptr-deref in atomic_add_unless include/linux/atomic.h:533 [inline]
BUG: KASAN: null-ptr-deref in mmget_not_zero include/linux/sched/mm.h:75 [inline]
BUG: KASAN: null-ptr-deref in binder_update_page_range+0xece/0x18e0 drivers/android/binder_alloc.c:218
Write of size 4 at addr 0000000000000058 by task syz-executor0/11184

CPU: 1 PID: 11184 Comm: syz-executor0 Not tainted 4.18.0-rc3 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x16e/0x22c lib/dump_stack.c:113
 kasan_report_error mm/kasan/report.c:352 [inline]
 kasan_report+0x163/0x380 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x140/0x1a0 mm/kasan/kasan.c:267
 kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278
 __atomic_add_unless include/asm-generic/atomic-instrumented.h:89 [inline]
 atomic_add_unless include/linux/atomic.h:533 [inline]
 mmget_not_zero include/linux/sched/mm.h:75 [inline]
 binder_update_page_range+0xece/0x18e0 drivers/android/binder_alloc.c:218
 binder_alloc_new_buf_locked drivers/android/binder_alloc.c:443 [inline]
 binder_alloc_new_buf+0x467/0xc30 drivers/android/binder_alloc.c:513
 binder_transaction+0x125b/0x4fb0 drivers/android/binder.c:2957
 binder_thread_write+0xc08/0x2770 drivers/android/binder.c:3528
 binder_ioctl_write_read.isra.39+0x24f/0x8e0 drivers/android/binder.c:4456
 binder_ioctl+0xa86/0xf34 drivers/android/binder.c:4596
 vfs_ioctl fs/ioctl.c:46 [inline]
 do_vfs_ioctl+0x154/0xd40 fs/ioctl.c:686
 ksys_ioctl+0x94/0xb0 fs/ioctl.c:701
 __do_sys_ioctl fs/ioctl.c:708 [inline]
 __se_sys_ioctl fs/ioctl.c:706 [inline]
 __x64_sys_ioctl+0x43/0x50 fs/ioctl.c:706
 do_syscall_64+0x167/0x4b0 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
"

Signed-off-by: Todd Kjos <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: Martijn Coenen <[email protected]>
Cc: stable <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
mjmartineau pushed a commit that referenced this issue Sep 28, 2018
This reverts commit 12eeeb4.

The patch doesn't fix accessing memory with null pointer in
skl_interrupt().

There are two problems: 1) skl_init_chip() is called twice, before
and after dma buffer is allocate. The first call sets bus->chip_init
which prevents the second from initializing bus->corb.buf and
rirb.buf from bus->rb.area. 2) snd_hdac_bus_init_chip() enables
interrupt before snd_hdac_bus_init_cmd_io() initializing dma buffers.
There is a small window which skl_interrupt() can be called if irq
has been acquired. If so, it crashes when using null dma buffer
pointers.

Will fix the problems in the following patches. Also attaching the
crash for future reference.

[   16.949148] general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
<snipped>
[   16.950903] Call Trace:
[   16.950906]  <IRQ>
[   16.950918]  skl_interrupt+0x19e/0x2d6 [snd_soc_skl]
[   16.950926]  ? dma_supported+0xb5/0xb5 [snd_soc_skl]
[   16.950933]  __handle_irq_event_percpu+0x27a/0x6c8
[   16.950937]  ? __irq_wake_thread+0x1d1/0x1d1
[   16.950942]  ? __do_softirq+0x57a/0x69e
[   16.950944]  handle_irq_event_percpu+0x95/0x1ba
[   16.950948]  ? _raw_spin_unlock+0x65/0xdc
[   16.950951]  ? __handle_irq_event_percpu+0x6c8/0x6c8
[   16.950953]  ? _raw_spin_unlock+0x65/0xdc
[   16.950957]  ? time_cpufreq_notifier+0x483/0x483
[   16.950959]  handle_irq_event+0x89/0x123
[   16.950962]  handle_fasteoi_irq+0x16f/0x425
[   16.950965]  handle_irq+0x1fe/0x28e
[   16.950969]  do_IRQ+0x6e/0x12e
[   16.950972]  common_interrupt+0x7a/0x7a
[   16.950974]  </IRQ>
<snipped>
[   16.951031] RIP: snd_hdac_bus_update_rirb+0x19b/0x4cf [snd_hda_core] RSP: ffff88015c807c08
[   16.951036] ---[ end trace 58bf9ece1775bc92 ]---

Fixes: 2eeeb4f4733b ("ASoC: Intel: Skylake: Acquire irq after RIRB allocation")
Signed-off-by: Yu Zhao <[email protected]>
Signed-off-by: Mark Brown <[email protected]>
mjmartineau pushed a commit that referenced this issue Sep 28, 2018
When netvsc device is removed it can call reschedule in RCU context.
This happens because canceling the subchannel setup work could (in theory)
cause a reschedule when manipulating the timer.

To reproduce, run with lockdep enabled kernel and unbind
a network device from hv_netvsc (via sysfs).

[  160.682011] WARNING: suspicious RCU usage
[  160.707466] 4.19.0-rc3-uio+ #2 Not tainted
[  160.709937] -----------------------------
[  160.712352] ./include/linux/rcupdate.h:302 Illegal context switch in RCU read-side critical section!
[  160.723691]
[  160.723691] other info that might help us debug this:
[  160.723691]
[  160.730955]
[  160.730955] rcu_scheduler_active = 2, debug_locks = 1
[  160.762813] 5 locks held by rebind-eth.sh/1812:
[  160.766851]  #0: 000000008befa37a (sb_writers#6){.+.+}, at: vfs_write+0x184/0x1b0
[  160.773416]  #1: 00000000b097f236 (&of->mutex){+.+.}, at: kernfs_fop_write+0xe2/0x1a0
[  160.783766]  #2: 0000000041ee6889 (kn->count#3){++++}, at: kernfs_fop_write+0xeb/0x1a0
[  160.787465]  #3: 0000000056d92a74 (&dev->mutex){....}, at: device_release_driver_internal+0x39/0x250
[  160.816987]  #4: 0000000030f6031e (rcu_read_lock){....}, at: netvsc_remove+0x1e/0x250 [hv_netvsc]
[  160.828629]
[  160.828629] stack backtrace:
[  160.831966] CPU: 1 PID: 1812 Comm: rebind-eth.sh Not tainted 4.19.0-rc3-uio+ #2
[  160.832952] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v1.0 11/26/2012
[  160.832952] Call Trace:
[  160.832952]  dump_stack+0x85/0xcb
[  160.832952]  ___might_sleep+0x1a3/0x240
[  160.832952]  __flush_work+0x57/0x2e0
[  160.832952]  ? __mutex_lock+0x83/0x990
[  160.832952]  ? __kernfs_remove+0x24f/0x2e0
[  160.832952]  ? __kernfs_remove+0x1b2/0x2e0
[  160.832952]  ? mark_held_locks+0x50/0x80
[  160.832952]  ? get_work_pool+0x90/0x90
[  160.832952]  __cancel_work_timer+0x13c/0x1e0
[  160.832952]  ? netvsc_remove+0x1e/0x250 [hv_netvsc]
[  160.832952]  ? __lock_is_held+0x55/0x90
[  160.832952]  netvsc_remove+0x9a/0x250 [hv_netvsc]
[  160.832952]  vmbus_remove+0x26/0x30
[  160.832952]  device_release_driver_internal+0x18a/0x250
[  160.832952]  unbind_store+0xb4/0x180
[  160.832952]  kernfs_fop_write+0x113/0x1a0
[  160.832952]  __vfs_write+0x36/0x1a0
[  160.832952]  ? rcu_read_lock_sched_held+0x6b/0x80
[  160.832952]  ? rcu_sync_lockdep_assert+0x2e/0x60
[  160.832952]  ? __sb_start_write+0x141/0x1a0
[  160.832952]  ? vfs_write+0x184/0x1b0
[  160.832952]  vfs_write+0xbe/0x1b0
[  160.832952]  ksys_write+0x55/0xc0
[  160.832952]  do_syscall_64+0x60/0x1b0
[  160.832952]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  160.832952] RIP: 0033:0x7fe48f4c8154

Resolve this by getting RTNL earlier. This is safe because the subchannel
work queue does trylock on RTNL and will detect the race.

Fixes: 7b2ee50 ("hv_netvsc: common detach logic")
Signed-off-by: Stephen Hemminger <[email protected]>
Reviewed-by: Haiyang Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
mjmartineau pushed a commit that referenced this issue Sep 28, 2018
The command 'xl vcpu-set 0 0', issued in dom0, will crash dom0:

BUG: unable to handle kernel NULL pointer dereference at 00000000000002d8
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 7 PID: 65 Comm: xenwatch Not tainted 4.19.0-rc2-1.ga9462db-default #1 openSUSE Tumbleweed (unreleased)
Hardware name: Intel Corporation S5520UR/S5520UR, BIOS S5500.86B.01.00.0050.050620101605 05/06/2010
RIP: e030:device_offline+0x9/0xb0
Code: 77 24 00 e9 ce fe ff ff 48 8b 13 e9 68 ff ff ff 48 8b 13 e9 29 ff ff ff 48 8b 13 e9 ea fe ff ff 90 66 66 66 66 90 41 54 55 53 <f6> 87 d8 02 00 00 01 0f 85 88 00 00 00 48 c7 c2 20 09 60 81 31 f6
RSP: e02b:ffffc90040f27e80 EFLAGS: 00010203
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff8801f3800000 RSI: ffffc90040f27e70 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffffff820e47b3 R09: 0000000000000000
R10: 0000000000007ff0 R11: 0000000000000000 R12: ffffffff822e6d30
R13: dead000000000200 R14: dead000000000100 R15: ffffffff8158b4e0
FS:  00007ffa595158c0(0000) GS:ffff8801f39c0000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000002d8 CR3: 00000001d9602000 CR4: 0000000000002660
Call Trace:
 handle_vcpu_hotplug_event+0xb5/0xc0
 xenwatch_thread+0x80/0x140
 ? wait_woken+0x80/0x80
 kthread+0x112/0x130
 ? kthread_create_worker_on_cpu+0x40/0x40
 ret_from_fork+0x3a/0x50

This happens because handle_vcpu_hotplug_event is called twice. In the
first iteration cpu_present is still true, in the second iteration
cpu_present is false which causes get_cpu_device to return NULL.
In case of cpu#0, cpu_online is apparently always true.

Fix this crash by checking if the cpu can be hotplugged, which is false
for a cpu that was just removed.

Also check if the cpu was actually offlined by device_remove, otherwise
leave the cpu_present state as it is.

Rearrange to code to do all work with device_hotplug_lock held.

Signed-off-by: Olaf Hering <[email protected]>
Reviewed-by: Juergen Gross <[email protected]>
Signed-off-by: Boris Ostrovsky <[email protected]>
mjmartineau pushed a commit that referenced this issue Sep 28, 2018
…inux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "These are a handful of fixes for problems that Trond found. Patch #1
  and #3 have the same name, a second issue was found after applying the
  first patch.

  Stable bugfixes:
   - v4.17+: Fix tracepoint Oops in initiate_file_draining()
   - v4.11+: Fix an infinite loop on I/O

  Other fixes:
   - Return errors if a waiting layoutget is killed
   - Don't open code clearing of delegation state"

* tag 'nfs-for-4.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Don't open code clearing of delegation state
  NFSv4.1 fix infinite loop on I/O.
  NFSv4: Fix a tracepoint Oops in initiate_file_draining()
  pNFS: Ensure we return the error if someone kills a waiting layoutget
  NFSv4: Fix a tracepoint Oops in initiate_file_draining()
mjmartineau pushed a commit that referenced this issue Sep 28, 2018
Chen Yu reported a divide-by-zero error when accessing the 'size'
resctrl file when a MBA resource is enabled.

divide error: 0000 [#1] SMP PTI
CPU: 93 PID: 1929 Comm: cat Not tainted 4.19.0-rc2-debug-rdt+ #25
RIP: 0010:rdtgroup_cbm_to_size+0x7e/0xa0
Call Trace:
rdtgroup_size_show+0x11a/0x1d0
seq_read+0xd8/0x3b0

Quoting Chen Yu's report: This is because for MB resource, the
r->cache.cbm_len is zero, thus calculating size in rdtgroup_cbm_to_size()
will trigger the exception.

Fix this issue in the 'size' file by getting correct memory bandwidth value
which is in MBps when MBA software controller is enabled or in percentage
when MBA software controller is disabled.

Fixes: d9b48c8 ("x86/intel_rdt: Display resource groups' allocations in bytes")
Reported-by: Chen Yu <[email protected]>
Signed-off-by: Reinette Chatre <[email protected]>
Signed-off-by: Fenghua Yu <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Chen Yu <[email protected]>
Cc: "H Peter Anvin" <[email protected]>
Cc: "Tony Luck" <[email protected]>
Cc: "Xiaochen Shen" <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
matttbe pushed a commit that referenced this issue May 9, 2025
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Fixes: a3dc298 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <[email protected]>
matttbe pushed a commit that referenced this issue May 9, 2025
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Fixes: a3dc298 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <[email protected]>
matttbe pushed a commit that referenced this issue May 9, 2025
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Fixes: a3dc298 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <[email protected]>
matttbe pushed a commit that referenced this issue May 9, 2025
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Fixes: a3dc298 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <[email protected]>
matttbe pushed a commit that referenced this issue May 12, 2025
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Fixes: a3dc298 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <[email protected]>
matttbe pushed a commit that referenced this issue May 12, 2025
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Fixes: a3dc298 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <[email protected]>
matttbe pushed a commit that referenced this issue May 13, 2025
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Fixes: a3dc298 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <[email protected]>
matttbe pushed a commit that referenced this issue May 13, 2025
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Fixes: a3dc298 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <[email protected]>
matttbe pushed a commit that referenced this issue May 14, 2025
MACsec offload is not supported in switchdev mode for uplink
representors. When switching to the uplink representor profile, the
MACsec offload feature must be cleared from the netdevice's features.

If left enabled, attempts to add offloads result in a null pointer
dereference, as the uplink representor does not support MACsec offload
even though the feature bit remains set.

Clear NETIF_F_HW_MACSEC in mlx5e_fix_uplink_rep_features().

Kernel log:

Oops: general protection fault, probably for non-canonical address 0xdffffc000000000f: 0000 [#1] SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000078-0x000000000000007f]
CPU: 29 UID: 0 PID: 4714 Comm: ip Not tainted 6.14.0-rc4_for_upstream_debug_2025_03_02_17_35 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:__mutex_lock+0x128/0x1dd0
Code: d0 7c 08 84 d2 0f 85 ad 15 00 00 8b 35 91 5c fe 03 85 f6 75 29 49 8d 7e 60 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 a6 15 00 00 4d 3b 76 60 0f 85 fd 0b 00 00 65 ff
RSP: 0018:ffff888147a4f160 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 000000000000000f RSI: 0000000000000000 RDI: 0000000000000078
RBP: ffff888147a4f2e0 R08: ffffffffa05d2c19 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000018 R15: ffff888152de0000
FS:  00007f855e27d800(0000) GS:ffff88881ee80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004e5768 CR3: 000000013ae7c005 CR4: 0000000000372eb0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? die_addr+0x3d/0xa0
 ? exc_general_protection+0x144/0x220
 ? asm_exc_general_protection+0x22/0x30
 ? mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core]
 ? __mutex_lock+0x128/0x1dd0
 ? lockdep_set_lock_cmp_fn+0x190/0x190
 ? mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core]
 ? mutex_lock_io_nested+0x1ae0/0x1ae0
 ? lock_acquire+0x1c2/0x530
 ? macsec_upd_offload+0x145/0x380
 ? lockdep_hardirqs_on_prepare+0x400/0x400
 ? kasan_save_stack+0x30/0x40
 ? kasan_save_stack+0x20/0x40
 ? kasan_save_track+0x10/0x30
 ? __kasan_kmalloc+0x77/0x90
 ? __kmalloc_noprof+0x249/0x6b0
 ? genl_family_rcv_msg_attrs_parse.constprop.0+0xb5/0x240
 ? mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core]
 mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core]
 ? mlx5e_macsec_add_rxsa+0x11a0/0x11a0 [mlx5_core]
 macsec_update_offload+0x26c/0x820
 ? macsec_set_mac_address+0x4b0/0x4b0
 ? lockdep_hardirqs_on_prepare+0x284/0x400
 ? _raw_spin_unlock_irqrestore+0x47/0x50
 macsec_upd_offload+0x2c8/0x380
 ? macsec_update_offload+0x820/0x820
 ? __nla_parse+0x22/0x30
 ? genl_family_rcv_msg_attrs_parse.constprop.0+0x15e/0x240
 genl_family_rcv_msg_doit+0x1cc/0x2a0
 ? genl_family_rcv_msg_attrs_parse.constprop.0+0x240/0x240
 ? cap_capable+0xd4/0x330
 genl_rcv_msg+0x3ea/0x670
 ? genl_family_rcv_msg_dumpit+0x2a0/0x2a0
 ? lockdep_set_lock_cmp_fn+0x190/0x190
 ? macsec_update_offload+0x820/0x820
 netlink_rcv_skb+0x12b/0x390
 ? genl_family_rcv_msg_dumpit+0x2a0/0x2a0
 ? netlink_ack+0xd80/0xd80
 ? rwsem_down_read_slowpath+0xf90/0xf90
 ? netlink_deliver_tap+0xcd/0xac0
 ? netlink_deliver_tap+0x155/0xac0
 ? _copy_from_iter+0x1bb/0x12c0
 genl_rcv+0x24/0x40
 netlink_unicast+0x440/0x700
 ? netlink_attachskb+0x760/0x760
 ? lock_acquire+0x1c2/0x530
 ? __might_fault+0xbb/0x170
 netlink_sendmsg+0x749/0xc10
 ? netlink_unicast+0x700/0x700
 ? __might_fault+0xbb/0x170
 ? netlink_unicast+0x700/0x700
 __sock_sendmsg+0xc5/0x190
 ____sys_sendmsg+0x53f/0x760
 ? import_iovec+0x7/0x10
 ? kernel_sendmsg+0x30/0x30
 ? __copy_msghdr+0x3c0/0x3c0
 ? filter_irq_stacks+0x90/0x90
 ? stack_depot_save_flags+0x28/0xa30
 ___sys_sendmsg+0xeb/0x170
 ? kasan_save_stack+0x30/0x40
 ? copy_msghdr_from_user+0x110/0x110
 ? do_syscall_64+0x6d/0x140
 ? lock_acquire+0x1c2/0x530
 ? __virt_addr_valid+0x116/0x3b0
 ? __virt_addr_valid+0x1da/0x3b0
 ? lock_downgrade+0x680/0x680
 ? __delete_object+0x21/0x50
 __sys_sendmsg+0xf7/0x180
 ? __sys_sendmsg_sock+0x20/0x20
 ? kmem_cache_free+0x14c/0x4e0
 ? __x64_sys_close+0x78/0xd0
 do_syscall_64+0x6d/0x140
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f855e113367
Code: 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
RSP: 002b:00007ffd15e90c88 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f855e113367
RDX: 0000000000000000 RSI: 00007ffd15e90cf0 RDI: 0000000000000004
RBP: 00007ffd15e90dbc R08: 0000000000000028 R09: 000000000045d100
R10: 00007f855e011dd8 R11: 0000000000000246 R12: 0000000000000019
R13: 0000000067c6b785 R14: 00000000004a1e80 R15: 0000000000000000
 </TASK>
Modules linked in: 8021q garp mrp sch_ingress openvswitch nsh mlx5_ib mlx5_fwctl mlx5_dpll mlx5_core rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay zram zsmalloc fuse [last unloaded: mlx5_core]
---[ end trace 0000000000000000 ]---

Fixes: 8ff0ac5 ("net/mlx5: Add MACsec offload Tx command support")
Signed-off-by: Carolina Jubran <[email protected]>
Reviewed-by: Shahar Shitrit <[email protected]>
Reviewed-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
matttbe pushed a commit that referenced this issue May 14, 2025
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Fixes: a3dc298 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <[email protected]>
matttbe pushed a commit that referenced this issue May 14, 2025
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Fixes: a3dc298 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <[email protected]>
matttbe pushed a commit that referenced this issue May 15, 2025
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Fixes: a3dc298 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <[email protected]>
matttbe pushed a commit that referenced this issue May 15, 2025
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Fixes: a3dc298 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <[email protected]>
matttbe pushed a commit that referenced this issue May 16, 2025
Calling core::fmt::write() from rust code while FineIBT is enabled
results in a kernel panic:

[ 4614.199779] kernel BUG at arch/x86/kernel/cet.c:132!
[ 4614.205343] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 4614.211781] CPU: 2 UID: 0 PID: 6057 Comm: dmabuf_dump Tainted: G     U     O       6.12.17-android16-0-g6ab38c534a43 #1 9da040f27673ec3945e23b998a0f8bd64c846599
[ 4614.227832] Tainted: [U]=USER, [O]=OOT_MODULE
[ 4614.241247] RIP: 0010:do_kernel_cp_fault+0xea/0xf0
...
[ 4614.398144] RIP: 0010:_RNvXs5_NtNtNtCs3o2tGsuHyou_4core3fmt3num3impyNtB9_7Display3fmt+0x0/0x20
[ 4614.407792] Code: 48 f7 df 48 0f 48 f9 48 89 f2 89 c6 5d e9 18 fd ff ff 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 41 81 ea 14 61 af 2c 74 03 0f 0b 90 <66> 0f 1f 00 55 48 89 e5 48 89 f2 48 8b 3f be 01 00 00 00 5d e9 e7
[ 4614.428775] RSP: 0018:ffffb95acfa4ba68 EFLAGS: 00010246
[ 4614.434609] RAX: 0000000000000000 RBX: 0000000000000010 RCX: 0000000000000000
[ 4614.442587] RDX: 0000000000000007 RSI: ffffb95acfa4ba70 RDI: ffffb95acfa4bc88
[ 4614.450557] RBP: ffffb95acfa4bae0 R08: ffff0a00ffffff05 R09: 0000000000000070
[ 4614.458527] R10: 0000000000000000 R11: ffffffffab67eaf0 R12: ffffb95acfa4bcc8
[ 4614.466493] R13: ffffffffac5d50f0 R14: 0000000000000000 R15: 0000000000000000
[ 4614.474473]  ? __cfi__RNvXs5_NtNtNtCs3o2tGsuHyou_4core3fmt3num3impyNtB9_7Display3fmt+0x10/0x10
[ 4614.484118]  ? _RNvNtCs3o2tGsuHyou_4core3fmt5write+0x1d2/0x250

This happens because core::fmt::write() calls
core::fmt::rt::Argument::fmt(), which currently has CFI disabled:

library/core/src/fmt/rt.rs:
171     // FIXME: Transmuting formatter in new and indirectly branching to/calling
172     // it here is an explicit CFI violation.
173     #[allow(inline_no_sanitize)]
174     #[no_sanitize(cfi, kcfi)]
175     #[inline]
176     pub(super) unsafe fn fmt(&self, f: &mut Formatter<'_>) -> Result {

This causes a Control Protection exception, because FineIBT has sealed
off the original function's endbr64.

This makes rust currently incompatible with FineIBT. Add a Kconfig
dependency that prevents FineIBT from getting turned on by default
if rust is enabled.

[ Rust 1.88.0 (scheduled for 2025-06-26) should have this fixed [1],
  and thus we relaxed the condition with Rust >= 1.88.

  When `objtool` lands checking for this with e.g. [2], the plan is
  to ideally run that in upstream Rust's CI to prevent regressions
  early [3], since we do not control `core`'s source code.

  Alice tested the Rust PR backported to an older compiler.

  Peter would like that Rust provides a stable `core` which can be
  pulled into the kernel: "Relying on that much out of tree code is
  'unfortunate'".

    - Miguel ]

Signed-off-by: Paweł Anikiel <[email protected]>
Reviewed-by: Alice Ryhl <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Link: rust-lang/rust#139632 [1]
Link: https://lore.kernel.org/rust-for-linux/[email protected]/ [2]
Link: rust-lang/rust#139632 (comment) [3]
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/att0-CANiq72kjDM0cKALVy4POEzhfdT4nO7tqz0Pm7xM+3=_0+L1t=A@mail.gmail.com
[ Reduced splat. - Miguel ]
Signed-off-by: Miguel Ojeda <[email protected]>
matttbe pushed a commit that referenced this issue May 16, 2025
When userspace does PR_SET_TAGGED_ADDR_CTRL, but Supm extension is not
available, the kernel crashes:

Oops - illegal instruction [#1]
    [snip]
epc : set_tagged_addr_ctrl+0x112/0x15a
 ra : set_tagged_addr_ctrl+0x74/0x15a
epc : ffffffff80011ace ra : ffffffff80011a30 sp : ffffffc60039be10
    [snip]
status: 0000000200000120 badaddr: 0000000010a79073 cause: 0000000000000002
    set_tagged_addr_ctrl+0x112/0x15a
    __riscv_sys_prctl+0x352/0x73c
    do_trap_ecall_u+0x17c/0x20c
    andle_exception+0x150/0x15c

Fix it by checking if Supm is available.

Fixes: 09d6775 ("riscv: Add support for userspace pointer masking")
Signed-off-by: Nam Cao <[email protected]>
Cc: [email protected]
Reviewed-by: Samuel Holland <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexandre Ghiti <[email protected]>
matttbe pushed a commit that referenced this issue May 16, 2025
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:

    WARNING: suspicious RCU usage
    kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!

    other info that might help us debug this:
	#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0

    Call Trace:
	fprobe_module_callback
	notifier_call_chain
	blocking_notifier_call_chain

This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.

Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.

Link: https://lore.kernel.org/all/[email protected]/

Fixes: a3dc298 ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <[email protected]>
Tested-by: Antonio Quartulli <[email protected]>
Tested-by: Matthieu Baerts (NGI0) <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
matttbe pushed a commit that referenced this issue May 16, 2025
 into HEAD

KVM/riscv fixes for 6.15, take #1

- Add missing reset of smstateen CSRs
matttbe pushed a commit that referenced this issue May 16, 2025
If the discard worker is running and there's currently only one block
group, that block group is a data block group, it's in the unused block
groups discard list and is being used (it got an extent allocated from it
after becoming unused), the worker can end up in an infinite loop if a
transaction abort happens or the async discard is disabled (during remount
or unmount for example).

This happens like this:

1) Task A, the discard worker, is at peek_discard_list() and
   find_next_block_group() returns block group X;

2) Block group X is in the unused block groups discard list (its discard
   index is BTRFS_DISCARD_INDEX_UNUSED) since at some point in the past
   it become an unused block group and was added to that list, but then
   later it got an extent allocated from it, so its ->used counter is not
   zero anymore;

3) The current transaction is aborted by task B and we end up at
   __btrfs_handle_fs_error() in the transaction abort path, where we call
   btrfs_discard_stop(), which clears BTRFS_FS_DISCARD_RUNNING from
   fs_info, and then at __btrfs_handle_fs_error() we set the fs to RO mode
   (setting SB_RDONLY in the super block's s_flags field);

4) Task A calls __add_to_discard_list() with the goal of moving the block
   group from the unused block groups discard list into another discard
   list, but at __add_to_discard_list() we end up doing nothing because
   btrfs_run_discard_work() returns false, since the super block has
   SB_RDONLY set in its flags and BTRFS_FS_DISCARD_RUNNING is not set
   anymore in fs_info->flags. So block group X remains in the unused block
   groups discard list;

5) Task A then does a goto into the 'again' label, calls
   find_next_block_group() again we gets block group X again. Then it
   repeats the previous steps over and over since there are not other
   block groups in the discard lists and block group X is never moved
   out of the unused block groups discard list since
   btrfs_run_discard_work() keeps returning false and therefore
   __add_to_discard_list() doesn't move block group X out of that discard
   list.

When this happens we can get a soft lockup report like this:

  [71.957] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [kworker/u4:3:97]
  [71.957] Modules linked in: xfs af_packet rfkill (...)
  [71.957] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G        W          6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
  [71.957] Tainted: [W]=WARN
  [71.957] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
  [71.957] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
  [71.957] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
  [71.957] Code: c1 01 48 83 (...)
  [71.957] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
  [71.957] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
  [71.957] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
  [71.957] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
  [71.957] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
  [71.957] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
  [71.957] FS:  0000000000000000(0000) GS:ffff89707bc00000(0000) knlGS:0000000000000000
  [71.957] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [71.957] CR2: 00005605bcc8d2f0 CR3: 000000010376a001 CR4: 0000000000770ef0
  [71.957] PKRU: 55555554
  [71.957] Call Trace:
  [71.957]  <TASK>
  [71.957]  process_one_work+0x17e/0x330
  [71.957]  worker_thread+0x2ce/0x3f0
  [71.957]  ? __pfx_worker_thread+0x10/0x10
  [71.957]  kthread+0xef/0x220
  [71.957]  ? __pfx_kthread+0x10/0x10
  [71.957]  ret_from_fork+0x34/0x50
  [71.957]  ? __pfx_kthread+0x10/0x10
  [71.957]  ret_from_fork_asm+0x1a/0x30
  [71.957]  </TASK>
  [71.957] Kernel panic - not syncing: softlockup: hung tasks
  [71.987] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G        W    L     6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
  [71.989] Tainted: [W]=WARN, [L]=SOFTLOCKUP
  [71.989] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
  [71.991] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
  [71.992] Call Trace:
  [71.993]  <IRQ>
  [71.994]  dump_stack_lvl+0x5a/0x80
  [71.994]  panic+0x10b/0x2da
  [71.995]  watchdog_timer_fn.cold+0x9a/0xa1
  [71.996]  ? __pfx_watchdog_timer_fn+0x10/0x10
  [71.997]  __hrtimer_run_queues+0x132/0x2a0
  [71.997]  hrtimer_interrupt+0xff/0x230
  [71.998]  __sysvec_apic_timer_interrupt+0x55/0x100
  [71.999]  sysvec_apic_timer_interrupt+0x6c/0x90
  [72.000]  </IRQ>
  [72.000]  <TASK>
  [72.001]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [72.002] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
  [72.002] Code: c1 01 48 83 (...)
  [72.005] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
  [72.006] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
  [72.006] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
  [72.007] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
  [72.008] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
  [72.009] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
  [72.010]  ? btrfs_discard_workfn+0x51/0x400 [btrfs 23b01089228eb964071fb7ca156eee8cd3bf996f]
  [72.011]  process_one_work+0x17e/0x330
  [72.012]  worker_thread+0x2ce/0x3f0
  [72.013]  ? __pfx_worker_thread+0x10/0x10
  [72.014]  kthread+0xef/0x220
  [72.014]  ? __pfx_kthread+0x10/0x10
  [72.015]  ret_from_fork+0x34/0x50
  [72.015]  ? __pfx_kthread+0x10/0x10
  [72.016]  ret_from_fork_asm+0x1a/0x30
  [72.017]  </TASK>
  [72.017] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
  [72.019] Rebooting in 90 seconds..

So fix this by making sure we move a block group out of the unused block
groups discard list when calling __add_to_discard_list().

Fixes: 2bee7eb ("btrfs: discard one region at a time in async discard")
Link: https://bugzilla.suse.com/show_bug.cgi?id=1242012
CC: [email protected] # 5.10+
Reviewed-by: Boris Burkov <[email protected]>
Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
matttbe pushed a commit that referenced this issue May 16, 2025
…unload

Kernel panic occurs when a devmem TCP socket is closed after NIC module
is unloaded.

This is Devmem TCP unregistration scenarios. number is an order.
(a)netlink socket close    (b)pp destroy    (c)uninstall    result
1                          2                3               OK
1                          3                2               (d)Impossible
2                          1                3               OK
3                          1                2               (e)Kernel panic
2                          3                1               (d)Impossible
3                          2                1               (d)Impossible

(a) netdev_nl_sock_priv_destroy() is called when devmem TCP socket is
    closed.
(b) page_pool_destroy() is called when the interface is down.
(c) mp_ops->uninstall() is called when an interface is unregistered.
(d) There is no scenario in mp_ops->uninstall() is called before
    page_pool_destroy().
    Because unregister_netdevice_many_notify() closes interfaces first
    and then calls mp_ops->uninstall().
(e) netdev_nl_sock_priv_destroy() accesses struct net_device to acquire
    netdev_lock().
    But if the interface module has already been removed, net_device
    pointer is invalid, so it causes kernel panic.

In summary, there are only 3 possible scenarios.
 A. sk close -> pp destroy -> uninstall.
 B. pp destroy -> sk close -> uninstall.
 C. pp destroy -> uninstall -> sk close.

Case C is a kernel panic scenario.

In order to fix this problem, It makes mp_dmabuf_devmem_uninstall() set
binding->dev to NULL.
It indicates an bound net_device was unregistered.

It makes netdev_nl_sock_priv_destroy() do not acquire netdev_lock()
if binding->dev is NULL.

A new binding->lock is added to protect a dev of a binding.
So, lock ordering is like below.
 priv->lock
 netdev_lock(dev)
 binding->lock

Tests:
Scenario A:
    ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
        -v 7 -t 1 -q 1 &
    pid=$!
    sleep 10
    kill $pid
    ip link set $interface down
    modprobe -rv $module

Scenario B:
    ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
        -v 7 -t 1 -q 1 &
    pid=$!
    sleep 10
    ip link set $interface down
    kill $pid
    modprobe -rv $module

Scenario C:
    ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
        -v 7 -t 1 -q 1 &
    pid=$!
    sleep 10
    modprobe -rv $module
    sleep 5
    kill $pid

Splat looks like:
Oops: general protection fault, probably for non-canonical address 0xdffffc001fffa9f7: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
KASAN: probably user-memory-access in range [0x00000000fffd4fb8-0x00000000fffd4fbf]
CPU: 0 UID: 0 PID: 2041 Comm: ncdevmem Tainted: G    B   W           6.15.0-rc1+ #2 PREEMPT(undef)  0947ec89efa0fd68838b78e36aa1617e97ff5d7f
Tainted: [B]=BAD_PAGE, [W]=WARN
RIP: 0010:__mutex_lock (./include/linux/sched.h:2244 kernel/locking/mutex.c:400 kernel/locking/mutex.c:443 kernel/locking/mutex.c:605 kernel/locking/mutex.c:746)
Code: ea 03 80 3c 02 00 0f 85 4f 13 00 00 49 8b 1e 48 83 e3 f8 74 6a 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 34 48 89 fa 48 c1 ea 03 <0f> b6 f
RSP: 0018:ffff88826f7ef730 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: 00000000fffd4f88 RCX: ffffffffaa9bc811
RDX: 000000001fffa9f7 RSI: 0000000000000008 RDI: 00000000fffd4fbc
RBP: ffff88826f7ef8b0 R08: 0000000000000000 R09: ffffed103e6aa1a4
R10: 0000000000000007 R11: ffff88826f7ef442 R12: fffffbfff669f65e
R13: ffff88812a830040 R14: ffff8881f3550d20 R15: 00000000fffd4f88
FS:  0000000000000000(0000) GS:ffff888866c05000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000563bed0cb288 CR3: 00000001a7c98000 CR4: 00000000007506f0
PKRU: 55555554
Call Trace:
<TASK>
 ...
 netdev_nl_sock_priv_destroy (net/core/netdev-genl.c:953 (discriminator 3))
 genl_release (net/netlink/genetlink.c:653 net/netlink/genetlink.c:694 net/netlink/genetlink.c:705)
 ...
 netlink_release (net/netlink/af_netlink.c:737)
 ...
 __sock_release (net/socket.c:647)
 sock_close (net/socket.c:1393)

Fixes: 1d22d30 ("net: drop rtnl_lock for queue_mgmt operations")
Signed-off-by: Taehee Yoo <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
matttbe pushed a commit that referenced this issue May 19, 2025
Guoyu Yin reported a splat in the ipmr netns cleanup path:

WARNING: CPU: 2 PID: 14564 at net/ipv4/ipmr.c:440 ipmr_free_table net/ipv4/ipmr.c:440 [inline]
WARNING: CPU: 2 PID: 14564 at net/ipv4/ipmr.c:440 ipmr_rules_exit+0x135/0x1c0 net/ipv4/ipmr.c:361
Modules linked in:
CPU: 2 UID: 0 PID: 14564 Comm: syz.4.838 Not tainted 6.14.0 #1
Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:ipmr_free_table net/ipv4/ipmr.c:440 [inline]
RIP: 0010:ipmr_rules_exit+0x135/0x1c0 net/ipv4/ipmr.c:361
Code: ff df 48 c1 ea 03 80 3c 02 00 75 7d 48 c7 83 60 05 00 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e e9 71 67 7f 00 e8 4c 2d 8a fd 90 <0f> 0b 90 eb 93 e8 41 2d 8a fd 0f b6 2d 80 54 ea 01 31 ff 89 ee e8
RSP: 0018:ffff888109547c58 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff888108c12dc0 RCX: ffffffff83e09868
RDX: ffff8881022b3300 RSI: ffffffff83e098d4 RDI: 0000000000000005
RBP: ffff888104288000 R08: 0000000000000000 R09: ffffed10211825c9
R10: 0000000000000001 R11: ffff88801816c4a0 R12: 0000000000000001
R13: ffff888108c13320 R14: ffff888108c12dc0 R15: fffffbfff0b74058
FS:  00007f84f39316c0(0000) GS:ffff88811b100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f84f3930f98 CR3: 0000000113b56000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 ipmr_net_exit_batch+0x50/0x90 net/ipv4/ipmr.c:3160
 ops_exit_list+0x10c/0x160 net/core/net_namespace.c:177
 setup_net+0x47d/0x8e0 net/core/net_namespace.c:394
 copy_net_ns+0x25d/0x410 net/core/net_namespace.c:516
 create_new_namespaces+0x3f6/0xaf0 kernel/nsproxy.c:110
 unshare_nsproxy_namespaces+0xc3/0x180 kernel/nsproxy.c:228
 ksys_unshare+0x78d/0x9a0 kernel/fork.c:3342
 __do_sys_unshare kernel/fork.c:3413 [inline]
 __se_sys_unshare kernel/fork.c:3411 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:3411
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xa6/0x1a0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f84f532cc29
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f84f3931038 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00007f84f5615fa0 RCX: 00007f84f532cc29
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000400
RBP: 00007f84f53fba18 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007f84f5615fa0 R15: 00007fff51c5f328
 </TASK>

The running kernel has CONFIG_IP_MROUTE_MULTIPLE_TABLES disabled, and
the sanity check for such build is still too loose.

Address the issue consolidating the relevant sanity check in a single
helper regardless of the kernel configuration. Also share it between
the ipv4 and ipv6 code.

Reported-by: Guoyu Yin <[email protected]>
Fixes: 50b9420 ("ipmr: tune the ipmr_can_free_table() checks.")
Signed-off-by: Paolo Abeni <[email protected]>
Link: https://patch.msgid.link/372dc261e1bf12742276e1b984fc5a071b7fc5a8.1747321903.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>
matttbe pushed a commit that referenced this issue May 22, 2025
While tracking an IDPF bug, I found that idpf_vport_splitq_napi_poll()
was not following NAPI rules.

It can indeed return @Budget after napi_complete() has been called.

Add two debug conditions in networking core to hopefully catch
this kind of bugs sooner.

IDPF bug will be fixed in a separate patch.

[   72.441242] repoll requested for device eth1 idpf_vport_splitq_napi_poll [idpf] but napi is not scheduled.
[   72.446291] list_del corruption. next->prev should be ff31783d93b14040, but was ff31783d93b10080. (next=ff31783d93b10080)
[   72.446659] kernel BUG at lib/list_debug.c:67!
[   72.446816] Oops: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC NOPTI
[   72.447031] CPU: 156 UID: 0 PID: 16258 Comm: ip Tainted: G        W           6.15.0-dbg-DEV #1944 NONE
[   72.447340] Tainted: [W]=WARN
[   72.447702] RIP: 0010:__list_del_entry_valid_or_report (lib/list_debug.c:65)
[   72.450630] Call Trace:
[   72.450720]  <IRQ>
[   72.450797] net_rx_action (include/linux/list.h:215 include/linux/list.h:287 net/core/dev.c:7385 net/core/dev.c:7516)
[   72.450928] ? lock_release (kernel/locking/lockdep.c:?)
[   72.451059] ? clockevents_program_event (kernel/time/clockevents.c:?)
[   72.451222] handle_softirqs (kernel/softirq.c:579)
[   72.451356] ? do_softirq (kernel/softirq.c:480)
[   72.451480] ? idpf_vc_xn_exec (drivers/net/ethernet/intel/idpf/idpf_virtchnl.c:462) idpf
[   72.451635] do_softirq (kernel/softirq.c:480)
[   72.451750]  </IRQ>
[   72.451828]  <TASK>
[   72.451905] __local_bh_enable_ip (kernel/softirq.c:?)
[   72.452051] idpf_vc_xn_exec (drivers/net/ethernet/intel/idpf/idpf_virtchnl.c:462) idpf
[   72.452210] idpf_send_delete_queues_msg (drivers/net/ethernet/intel/idpf/idpf_virtchnl.c:2083) idpf
[   72.452390] idpf_vport_stop (drivers/net/ethernet/intel/idpf/idpf_lib.c:837 drivers/net/ethernet/intel/idpf/idpf_lib.c:868) idpf
[   72.452541] ? idpf_vport_stop (include/linux/bottom_half.h:? include/linux/netdevice.h:4762 drivers/net/ethernet/intel/idpf/idpf_lib.c:855) idpf
[   72.452695] idpf_initiate_soft_reset (drivers/net/ethernet/intel/idpf/idpf_lib.c:?) idpf
[   72.452867] idpf_change_mtu (drivers/net/ethernet/intel/idpf/idpf_lib.c:2189) idpf
[   72.453015] netif_set_mtu_ext (net/core/dev.c:9437)
[   72.453157] ? packet_notifier (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/packet/af_packet.c:4240)
[   72.453292] netif_set_mtu (net/core/dev.c:9515)
[   72.453416] dev_set_mtu (net/core/dev_api.c:?)
[   72.453534] bond_change_mtu (drivers/net/bonding/bond_main.c:4833)
[   72.453666] netif_set_mtu_ext (net/core/dev.c:9437)
[   72.453803] do_setlink (net/core/rtnetlink.c:3116)
[   72.453925] ? rtnl_newlink (net/core/rtnetlink.c:3901)
[   72.454055] ? rtnl_newlink (net/core/rtnetlink.c:3901)
[   72.454185] ? rtnl_newlink (net/core/rtnetlink.c:3901)
[   72.454314] ? trace_contention_end (include/trace/events/lock.h:122)
[   72.454467] ? __mutex_lock (arch/x86/include/asm/preempt.h:85 kernel/locking/mutex.c:611 kernel/locking/mutex.c:746)
[   72.454597] ? cap_capable (include/trace/events/capability.h:26)
[   72.454721] ? security_capable (security/security.c:?)
[   72.454857] rtnl_newlink (net/core/rtnetlink.c:?)
[   72.454982] ? lock_is_held_type (kernel/locking/lockdep.c:5599 kernel/locking/lockdep.c:5938)
[   72.455121] ? __lock_acquire (kernel/locking/lockdep.c:?)
[   72.455256] ? __change_page_attr_set_clr (arch/x86/mm/pat/set_memory.c:685)
[   72.455438] ? __lock_acquire (kernel/locking/lockdep.c:?)
[   72.455582] ? rtnetlink_rcv_msg (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/core/rtnetlink.c:6885)
[   72.455721] ? lock_acquire (kernel/locking/lockdep.c:5866)
[   72.455848] ? rtnetlink_rcv_msg (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/core/rtnetlink.c:6885)
[   72.455987] ? lock_release (kernel/locking/lockdep.c:?)
[   72.456117] ? rcu_read_unlock (include/linux/rcupdate.h:341 include/linux/rcupdate.h:871)
[   72.456249] ? __pfx_rtnl_newlink (net/core/rtnetlink.c:3956)
[   72.456388] rtnetlink_rcv_msg (net/core/rtnetlink.c:6955)
[   72.456526] ? rtnetlink_rcv_msg (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/core/rtnetlink.c:6885)
[   72.456671] ? lock_acquire (kernel/locking/lockdep.c:5866)
[   72.456802] ? net_generic (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 include/net/netns/generic.h:45)
[   72.456929] ? __pfx_rtnetlink_rcv_msg (net/core/rtnetlink.c:6858)
[   72.457082] netlink_rcv_skb (net/netlink/af_netlink.c:2534)
[   72.457212] netlink_unicast (net/netlink/af_netlink.c:1313)
[   72.457344] netlink_sendmsg (net/netlink/af_netlink.c:1883)
[   72.457476] __sock_sendmsg (net/socket.c:712)
[   72.457602] ____sys_sendmsg (net/socket.c:?)
[   72.457735] ? _copy_from_user (arch/x86/include/asm/uaccess_64.h:126 arch/x86/include/asm/uaccess_64.h:134 arch/x86/include/asm/uaccess_64.h:141 include/linux/uaccess.h:178 lib/usercopy.c:18)
[   72.457875] ___sys_sendmsg (net/socket.c:2620)
[   72.458042] ? __call_rcu_common (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 kernel/rcu/tree.c:3107)
[   72.458185] ? mntput_no_expire (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 fs/namespace.c:1457)
[   72.458324] ? lock_acquire (kernel/locking/lockdep.c:5866)
[   72.458451] ? mntput_no_expire (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 fs/namespace.c:1457)
[   72.458588] ? lock_release (kernel/locking/lockdep.c:?)
[   72.458718] ? mntput_no_expire (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 fs/namespace.c:1457)
[   72.458856] __x64_sys_sendmsg (net/socket.c:2652)
[   72.458997] ? do_syscall_64 (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 include/linux/entry-common.h:198 arch/x86/entry/syscall_64.c:90)
[   72.459136] do_syscall_64 (arch/x86/entry/syscall_64.c:?)
[   72.459259] ? exc_page_fault (arch/x86/mm/fault.c:1542)
[   72.459387] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[   72.459555] RIP: 0033:0x7fd15f17cbd0

Signed-off-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
matttbe pushed a commit that referenced this issue May 23, 2025
Currently iwdev->rf is allocated in irdma_probe(), but free in
irdma_ib_dealloc_device(). It can be misleading. Move the free to
irdma_remove() to be more obvious.

Freeing in irdma_ib_dealloc_device() leads to KASAN use-after-free
issue. Which can also lead to NULL pointer dereference. Fix this.

irdma_deinit_interrupts() can't be moved before freeing iwdef->rf,
because in this case deinit interrupts will be done before freeing irqs.
The simplest solution is to move kfree(iwdev->rf) to irdma_remove().

Reproducer:
  sudo rmmod irdma

Minified splat(s):
  BUG: KASAN: use-after-free in irdma_remove+0x257/0x2d0 [irdma]
  Call Trace:
   <TASK>
   ? __pfx__raw_spin_lock_irqsave+0x10/0x10
   ? kfree+0x253/0x450
   ? irdma_remove+0x257/0x2d0 [irdma]
   kasan_report+0xed/0x120
   ? irdma_remove+0x257/0x2d0 [irdma]
   irdma_remove+0x257/0x2d0 [irdma]
   auxiliary_bus_remove+0x56/0x80
   device_release_driver_internal+0x371/0x530
   ? kernfs_put.part.0+0x147/0x310
   driver_detach+0xbf/0x180
   bus_remove_driver+0x11b/0x2a0
   auxiliary_driver_unregister+0x1a/0x50
   irdma_exit_module+0x40/0x4c [irdma]

  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
  RIP: 0010:ice_free_rdma_qvector+0x2a/0xa0 [ice]
  Call Trace:
   ? ice_free_rdma_qvector+0x2a/0xa0 [ice]
   irdma_remove+0x179/0x2d0 [irdma]
   auxiliary_bus_remove+0x56/0x80
   device_release_driver_internal+0x371/0x530
   ? kobject_put+0x61/0x4b0
   driver_detach+0xbf/0x180
   bus_remove_driver+0x11b/0x2a0
   auxiliary_driver_unregister+0x1a/0x50
   irdma_exit_module+0x40/0x4c [irdma]

Reported-by: Marcin Szycik <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Fixes: 3e0d3cb ("ice, irdma: move interrupts code to irdma")
Reviewed-by: Marcin Szycik <[email protected]>
Signed-off-by: Michal Swiatkowski <[email protected]>
Signed-off-by: Tatyana Nikolova <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Leon Romanovsky <[email protected]>
matttbe pushed a commit that referenced this issue May 23, 2025
When memory is insufficient, the allocation of nfs_lock_context in
nfs_get_lock_context() fails and returns -ENOMEM. If we mistakenly treat
an nfs4_unlockdata structure (whose l_ctx member has been set to -ENOMEM)
as valid and proceed to execute rpc_run_task(), this will trigger a NULL
pointer dereference in nfs4_locku_prepare. For example:

BUG: kernel NULL pointer dereference, address: 000000000000000c
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP PTI
CPU: 15 UID: 0 PID: 12 Comm: kworker/u64:0 Not tainted 6.15.0-rc2-dirty #60
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40
Workqueue: rpciod rpc_async_schedule
RIP: 0010:nfs4_locku_prepare+0x35/0xc2
Code: 89 f2 48 89 fd 48 c7 c7 68 69 ef b5 53 48 8b 8e 90 00 00 00 48 89 f3
RSP: 0018:ffffbbafc006bdb8 EFLAGS: 00010246
RAX: 000000000000004b RBX: ffff9b964fc1fa00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: fffffffffffffff4 RDI: ffff9ba53fddbf40
RBP: ffff9ba539934000 R08: 0000000000000000 R09: ffffbbafc006bc38
R10: ffffffffb6b689c8 R11: 0000000000000003 R12: ffff9ba539934030
R13: 0000000000000001 R14: 0000000004248060 R15: ffffffffb56d1c30
FS: 0000000000000000(0000) GS:ffff9ba5881f0000(0000) knlGS:00000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000000c CR3: 000000093f244000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 __rpc_execute+0xbc/0x480
 rpc_async_schedule+0x2f/0x40
 process_one_work+0x232/0x5d0
 worker_thread+0x1da/0x3d0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x10d/0x240
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x34/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Modules linked in:
CR2: 000000000000000c
---[ end trace 0000000000000000 ]---

Free the allocated nfs4_unlockdata when nfs_get_lock_context() fails and
return NULL to terminate subsequent rpc_run_task, preventing NULL pointer
dereference.

Fixes: f30cb75 ("NFS: Always wait for I/O completion before unlock")
Signed-off-by: Li Lingfeng <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Trond Myklebust <[email protected]>
matttbe pushed a commit that referenced this issue May 23, 2025
To support 36-bit DMA, configure chip proprietary bit via PCI config API
or chip DBI interface. However, the PCI device mmap isn't set yet and
the DBI is also inaccessible via mmap, so only if the bit can be accessible
via PCI config API, chip can support 36-bit DMA. Otherwise, fallback to
32-bit DMA.

With NULL mmap address, kernel throws trace:

  BUG: unable to handle page fault for address: 0000000000001090
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0002 [#1] PREEMPT SMP PTI
  CPU: 1 UID: 0 PID: 71 Comm: irq/26-pciehp Tainted: G           OE      6.14.2-061402-generic #202504101348
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  RIP: 0010:rtw89_pci_ops_write16+0x12/0x30 [rtw89_pci]
  RSP: 0018:ffffb0ffc0acf9d8 EFLAGS: 00010206
  RAX: ffffffffc158f9c0 RBX: ffff94865e702020 RCX: 0000000000000000
  RDX: 0000000000000718 RSI: 0000000000001090 RDI: ffff94865e702020
  RBP: ffffb0ffc0acf9d8 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000015
  R13: 0000000000000719 R14: ffffb0ffc0acfa1f R15: ffffffffc1813060
  FS:  0000000000000000(0000) GS:ffff9486f3480000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000001090 CR3: 0000000090440001 CR4: 00000000000626f0
  Call Trace:
   <TASK>
   rtw89_pci_read_config_byte+0x6d/0x120 [rtw89_pci]
   rtw89_pci_cfg_dac+0x5b/0xb0 [rtw89_pci]
   rtw89_pci_probe+0xa96/0xbd0 [rtw89_pci]
   ? __pfx___device_attach_driver+0x10/0x10
   ? __pfx___device_attach_driver+0x10/0x10
   local_pci_probe+0x47/0xa0
   pci_call_probe+0x5d/0x190
   pci_device_probe+0xa7/0x160
   really_probe+0xf9/0x370
   ? pm_runtime_barrier+0x55/0xa0
   __driver_probe_device+0x8c/0x140
   driver_probe_device+0x24/0xd0
   __device_attach_driver+0xcd/0x170
   bus_for_each_drv+0x99/0x100
   __device_attach+0xb4/0x1d0
   device_attach+0x10/0x20
   pci_bus_add_device+0x59/0x90
   pci_bus_add_devices+0x31/0x80
   pciehp_configure_device+0xaa/0x170
   pciehp_enable_slot+0xd6/0x240
   pciehp_handle_presence_or_link_change+0xf1/0x180
   pciehp_ist+0x162/0x1c0
   irq_thread_fn+0x24/0x70
   irq_thread+0xef/0x1c0
   ? __pfx_irq_thread_fn+0x10/0x10
   ? __pfx_irq_thread_dtor+0x10/0x10
   ? __pfx_irq_thread+0x10/0x10
   kthread+0xfc/0x230
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x47/0x70
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>

Fixes: 1fd4b3f ("wifi: rtw89: pci: support 36-bit PCI DMA address")
Reported-by: Bitterblue Smith <[email protected]>
Closes: https://lore.kernel.org/linux-wireless/[email protected]/T/#u
Closes: openwrt/openwrt#17025
Signed-off-by: Ping-Ke Shih <[email protected]>
Link: https://patch.msgid.link/[email protected]
matttbe pushed a commit that referenced this issue May 23, 2025
When xdp is attached or detached, dev->ndo_bpf() is called by
do_setlink(), and it acquires netdev_lock() if needed.
Unlike other drivers, the bnxt driver is protected by netdev_lock while
xdp is attached/detached because it sets dev->request_ops_lock to true.

So, the bnxt_xdp(), that is callback of ->ndo_bpf should not acquire
netdev_lock().
But the xdp_features_{set | clear}_redirect_target() was changed to
acquire netdev_lock() internally.
It causes a deadlock.
To fix this problem, bnxt driver should use
xdp_features_{set | clear}_redirect_target_locked() instead.

Splat looks like:
============================================
WARNING: possible recursive locking detected
6.15.0-rc6+ #1 Not tainted
--------------------------------------------
bpftool/1745 is trying to acquire lock:
ffff888131b85038 (&dev->lock){+.+.}-{4:4}, at: xdp_features_set_redirect_target+0x1f/0x80

but task is already holding lock:
ffff888131b85038 (&dev->lock){+.+.}-{4:4}, at: do_setlink.constprop.0+0x24e/0x35d0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&dev->lock);
  lock(&dev->lock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by bpftool/1745:
 #0: ffffffffa56131c8 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_setlink+0x1fe/0x570
 #1: ffffffffaafa75a0 (&net->rtnl_mutex){+.+.}-{4:4}, at: rtnl_setlink+0x236/0x570
 #2: ffff888131b85038 (&dev->lock){+.+.}-{4:4}, at: do_setlink.constprop.0+0x24e/0x35d0

stack backtrace:
CPU: 1 UID: 0 PID: 1745 Comm: bpftool Not tainted 6.15.0-rc6+ #1 PREEMPT(undef)
Hardware name: ASUS System Product Name/PRIME Z690-P D4, BIOS 0603 11/01/2021
Call Trace:
 <TASK>
 dump_stack_lvl+0x7a/0xd0
 print_deadlock_bug+0x294/0x3d0
 __lock_acquire+0x153b/0x28f0
 lock_acquire+0x184/0x340
 ? xdp_features_set_redirect_target+0x1f/0x80
 __mutex_lock+0x1ac/0x18a0
 ? xdp_features_set_redirect_target+0x1f/0x80
 ? xdp_features_set_redirect_target+0x1f/0x80
 ? __pfx_bnxt_rx_page_skb+0x10/0x10 [bnxt_en
 ? __pfx___mutex_lock+0x10/0x10
 ? __pfx_netdev_update_features+0x10/0x10
 ? bnxt_set_rx_skb_mode+0x284/0x540 [bnxt_en
 ? __pfx_bnxt_set_rx_skb_mode+0x10/0x10 [bnxt_en
 ? xdp_features_set_redirect_target+0x1f/0x80
 xdp_features_set_redirect_target+0x1f/0x80
 bnxt_xdp+0x34e/0x730 [bnxt_en 11cbcce8fa11cff1dddd7ef358d6219e4ca9add3]
 dev_xdp_install+0x3f4/0x830
 ? __pfx_bnxt_xdp+0x10/0x10 [bnxt_en 11cbcce8fa11cff1dddd7ef358d6219e4ca9add3]
 ? __pfx_dev_xdp_install+0x10/0x10
 dev_xdp_attach+0x560/0xf70
 dev_change_xdp_fd+0x22d/0x280
 do_setlink.constprop.0+0x2989/0x35d0
 ? __pfx_do_setlink.constprop.0+0x10/0x10
 ? lock_acquire+0x184/0x340
 ? find_held_lock+0x32/0x90
 ? rtnl_setlink+0x236/0x570
 ? rcu_is_watching+0x11/0xb0
 ? trace_contention_end+0xdc/0x120
 ? __mutex_lock+0x946/0x18a0
 ? __pfx___mutex_lock+0x10/0x10
 ? __lock_acquire+0xa95/0x28f0
 ? rcu_is_watching+0x11/0xb0
 ? rcu_is_watching+0x11/0xb0
 ? cap_capable+0x172/0x350
 rtnl_setlink+0x2cd/0x570

Fixes: 03df156 ("xdp: double protect netdev->xdp_flags with netdev->lock")
Signed-off-by: Taehee Yoo <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Michael Chan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
matttbe pushed a commit that referenced this issue May 29, 2025
Punching a hole with a start offset that exceeds max_end is not
permitted and will result in a negative length in the
truncate_inode_partial_folio() function while truncating the page cache,
potentially leading to undesirable consequences.

A simple reproducer:

  truncate -s 9895604649994 /mnt/foo
  xfs_io -c "pwrite 8796093022208 4096" /mnt/foo
  xfs_io -c "fpunch 8796093022213 25769803777" /mnt/foo

  kernel BUG at include/linux/highmem.h:275!
  Oops: invalid opcode: 0000 [#1] SMP PTI
  CPU: 3 UID: 0 PID: 710 Comm: xfs_io Not tainted 6.15.0-rc3
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:zero_user_segments.constprop.0+0xd7/0x110
  RSP: 0018:ffffc90001cf3b38 EFLAGS: 00010287
  RAX: 0000000000000005 RBX: ffffea0001485e40 RCX: 0000000000001000
  RDX: 000000000040b000 RSI: 0000000000000005 RDI: 000000000040b000
  RBP: 000000000040affb R08: ffff888000000000 R09: ffffea0000000000
  R10: 0000000000000003 R11: 00000000fffc7fc5 R12: 0000000000000005
  R13: 000000000040affb R14: ffffea0001485e40 R15: ffff888031cd3000
  FS:  00007f4f63d0b780(0000) GS:ffff8880d337d000(0000)
  knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000001ae0b038 CR3: 00000000536aa000 CR4: 00000000000006f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   truncate_inode_partial_folio+0x3dd/0x620
   truncate_inode_pages_range+0x226/0x720
   ? bdev_getblk+0x52/0x3e0
   ? ext4_get_group_desc+0x78/0x150
   ? crc32c_arch+0xfd/0x180
   ? __ext4_get_inode_loc+0x18c/0x840
   ? ext4_inode_csum+0x117/0x160
   ? jbd2_journal_dirty_metadata+0x61/0x390
   ? __ext4_handle_dirty_metadata+0xa0/0x2b0
   ? kmem_cache_free+0x90/0x5a0
   ? jbd2_journal_stop+0x1d5/0x550
   ? __ext4_journal_stop+0x49/0x100
   truncate_pagecache_range+0x50/0x80
   ext4_truncate_page_cache_block_range+0x57/0x3a0
   ext4_punch_hole+0x1fe/0x670
   ext4_fallocate+0x792/0x17d0
   ? __count_memcg_events+0x175/0x2a0
   vfs_fallocate+0x121/0x560
   ksys_fallocate+0x51/0xc0
   __x64_sys_fallocate+0x24/0x40
   x64_sys_call+0x18d2/0x4170
   do_syscall_64+0xa7/0x220
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix this by filtering out cases where the punching start offset exceeds
max_end.

Fixes: 982bf37 ("ext4: refactor ext4_punch_hole()")
Reported-by: Liebes Wang <[email protected]>
Closes: https://lore.kernel.org/linux-ext4/[email protected]/
Tested-by: Liebes Wang <[email protected]>
Signed-off-by: Zhang Yi <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Baokun Li <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
Cc: [email protected]
matttbe pushed a commit that referenced this issue May 29, 2025
__run_io_and_remove() is used in several stress tests for running heavy
IO vs. removing device meantime.

However, sequential `readwrite` is taken in the fio script, which isn't
correct, we should take random IO for saturating ublk device.

Also turns out '--num_jobs=4' isn't stressful enough, so change it to
'--num_jobs=$(nproc)'.

Finally we don't cover single queue test in `test_stress_02.sh`, so add
single queue test which can trigger request tag recycling easier.

With above change the issue in #1 can be reproduced reliably in stress_02.sh.

Link:https://lore.kernel.org/linux-block/mruqwpf4tqenkbtgezv5oxwq7ngyq24jzeyqy4ixzvivatbbxv@4oh2wzz4e6qn/ #1

Cc: Jared Holzman <[email protected]>
Cc: Shinichiro Kawasaki <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
matttbe pushed a commit that referenced this issue May 29, 2025
When performing a right split on a folio, the split_at2 may point to a
not-present page if the offset + length equals the original folio size,
which will trigger the following error:

 BUG: unable to handle page fault for address: ffffea0006000008
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 143ffb9067 P4D 143ffb9067 PUD 143ffb8067 PMD 0
 Oops: Oops: 0000 [#1] SMP PTI
 CPU: 0 UID: 0 PID: 502640 Comm: fsx Not tainted 6.15.0-rc3-gc6156189fc6b #889 PR
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/4
 RIP: 0010:truncate_inode_partial_folio+0x208/0x620
 Code: ff 03 48 01 da e8 78 7e 13 00 48 83 05 10 b5 5a 0c 01 85 c0 0f 85 1c 02 001
 RSP: 0018:ffffc90005bafab0 EFLAGS: 00010286
 RAX: 0000000000000000 RBX: ffffea0005ffff00 RCX: 0000000000000002
 RDX: 000000000000000c RSI: 0000000000013975 RDI: ffffc90005bafa30
 RBP: ffffea0006000000 R08: 0000000000000000 R09: 00000000000009bf
 R10: 00000000000007e0 R11: 0000000000000000 R12: 0000000000001633
 R13: 0000000000000000 R14: ffffea0005ffff00 R15: fffffffffffffffe
 FS:  00007f9f9a161740(0000) GS:ffff8894971fd000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffea0006000008 CR3: 000000017c2ae000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  truncate_inode_pages_range+0x226/0x720
  truncate_pagecache+0x57/0x90
  ...

Fix this issue by skipping the split if truncation aligns with the folio
size, make sure the split page number lies within the folio.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 7460b47 ("mm/truncate: use folio_split() in truncate operation")
Signed-off-by: Zhang Yi <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: ErKun Yang <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
matttbe pushed a commit that referenced this issue May 29, 2025
…ugetlb folios

A kernel crash was observed when replacing free hugetlb folios:

BUG: kernel NULL pointer dereference, address: 0000000000000028
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary)
RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0
RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000
RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000
RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000
R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000
R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004
FS:  00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0
Call Trace:
<TASK>
 replace_free_hugepage_folios+0xb6/0x100
 alloc_contig_range_noprof+0x18a/0x590
 ? srso_return_thunk+0x5/0x5f
 ? down_read+0x12/0xa0
 ? srso_return_thunk+0x5/0x5f
 cma_range_alloc.constprop.0+0x131/0x290
 __cma_alloc+0xcf/0x2c0
 cma_alloc_write+0x43/0xb0
 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110
 debugfs_attr_write+0x46/0x70
 full_proxy_write+0x62/0xa0
 vfs_write+0xf8/0x420
 ? srso_return_thunk+0x5/0x5f
 ? filp_flush+0x86/0xa0
 ? srso_return_thunk+0x5/0x5f
 ? filp_close+0x1f/0x30
 ? srso_return_thunk+0x5/0x5f
 ? do_dup2+0xaf/0x160
 ? srso_return_thunk+0x5/0x5f
 ksys_write+0x65/0xe0
 do_syscall_64+0x64/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

There is a potential race between __update_and_free_hugetlb_folio() and
replace_free_hugepage_folios():

CPU1                              CPU2
__update_and_free_hugetlb_folio   replace_free_hugepage_folios
                                    folio_test_hugetlb(folio)
                                    -- It's still hugetlb folio.

  __folio_clear_hugetlb(folio)
  hugetlb_free_folio(folio)
                                    h = folio_hstate(folio)
                                    -- Here, h is NULL pointer

When the above race condition occurs, folio_hstate(folio) returns NULL,
and subsequent access to this NULL pointer will cause the system to crash.
To resolve this issue, execute folio_hstate(folio) under the protection
of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not
return NULL.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 04f13d2 ("mm: replace free hugepage folios after migration")
Signed-off-by: Ge Yang <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
matttbe pushed a commit that referenced this issue May 29, 2025
syzkaller reported a null-ptr-deref in txopt_get(). [0]

The offset 0x70 was of struct ipv6_txoptions in struct ipv6_pinfo,
so struct ipv6_pinfo was NULL there.

However, this never happens for IPv6 sockets as inet_sk(sk)->pinet6
is always set in inet6_create(), meaning the socket was not IPv6 one.

The root cause is missing validation in netlbl_conn_setattr().

netlbl_conn_setattr() switches branches based on struct
sockaddr.sa_family, which is passed from userspace.  However,
netlbl_conn_setattr() does not check if the address family matches
the socket.

The syzkaller must have called connect() for an IPv6 address on
an IPv4 socket.

We have a proper validation in tcp_v[46]_connect(), but
security_socket_connect() is called in the earlier stage.

Let's copy the validation to netlbl_conn_setattr().

[0]:
Oops: general protection fault, probably for non-canonical address 0xdffffc000000000e: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
CPU: 2 UID: 0 PID: 12928 Comm: syz.9.1677 Not tainted 6.12.0 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:txopt_get include/net/ipv6.h:390 [inline]
RIP: 0010:
Code: 02 00 00 49 8b ac 24 f8 02 00 00 e8 84 69 2a fd e8 ff 00 16 fd 48 8d 7d 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 53 02 00 00 48 8b 6d 70 48 85 ed 0f 84 ab 01 00
RSP: 0018:ffff88811b8afc48 EFLAGS: 00010212
RAX: dffffc0000000000 RBX: 1ffff11023715f8a RCX: ffffffff841ab00c
RDX: 000000000000000e RSI: ffffc90007d9e000 RDI: 0000000000000070
RBP: 0000000000000000 R08: ffffed1023715f9d R09: ffffed1023715f9e
R10: ffffed1023715f9d R11: 0000000000000003 R12: ffff888123075f00
R13: ffff88810245bd80 R14: ffff888113646780 R15: ffff888100578a80
FS:  00007f9019bd7640(0000) GS:ffff8882d2d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f901b927bac CR3: 0000000104788003 CR4: 0000000000770ef0
PKRU: 80000000
Call Trace:
 <TASK>
 calipso_sock_setattr+0x56/0x80 net/netlabel/netlabel_calipso.c:557
 netlbl_conn_setattr+0x10c/0x280 net/netlabel/netlabel_kapi.c:1177
 selinux_netlbl_socket_connect_helper+0xd3/0x1b0 security/selinux/netlabel.c:569
 selinux_netlbl_socket_connect_locked security/selinux/netlabel.c:597 [inline]
 selinux_netlbl_socket_connect+0xb6/0x100 security/selinux/netlabel.c:615
 selinux_socket_connect+0x5f/0x80 security/selinux/hooks.c:4931
 security_socket_connect+0x50/0xa0 security/security.c:4598
 __sys_connect_file+0xa4/0x190 net/socket.c:2067
 __sys_connect+0x12c/0x170 net/socket.c:2088
 __do_sys_connect net/socket.c:2098 [inline]
 __se_sys_connect net/socket.c:2095 [inline]
 __x64_sys_connect+0x73/0xb0 net/socket.c:2095
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xaa/0x1b0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f901b61a12d
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f9019bd6fa8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 00007f901b925fa0 RCX: 00007f901b61a12d
RDX: 000000000000001c RSI: 0000200000000140 RDI: 0000000000000003
RBP: 00007f901b701505 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007f901b5b62a0 R15: 00007f9019bb7000
 </TASK>
Modules linked in:

Fixes: ceba183 ("calipso: Set the calipso socket label to match the secattr.")
Reported-by: syzkaller <[email protected]>
Reported-by: John Cheung <[email protected]>
Closes: https://lore.kernel.org/netdev/CAP=Rh=M1LzunrcQB1fSGauMrJrhL6GGps5cPAKzHJXj6GQV+-g@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Acked-by: Paul Moore <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants