Skip to content

[LTS 8.6] perf: Disallow mis-matched inherited group reads #475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 13, 2025

Conversation

pvts-mat
Copy link
Contributor

[LTS 8.6]
CVE-2023-5717
VULN-4127

Problem

https://www.cve.org/CVERecord?id=CVE-2023-5717

A heap out-of-bounds write vulnerability in the Linux kernel's Linux Kernel Performance Events (perf) component can be exploited to achieve local privilege escalation. If perf_read_group() is called while an event's sibling_list is smaller than its child's sibling_list, it can increment or write to memory locations outside of the allocated buffer.

Applicability: yes

The perf component is included with the CONFIG_PERF_EVENTS option, which is enabled in most ciqlts8_6 configs

$ grep 'CONFIG_PERF_EVENTS\b' configs/*.config

configs/kernel-aarch64-debug.config:CONFIG_PERF_EVENTS=y
configs/kernel-aarch64.config:CONFIG_PERF_EVENTS=y
configs/kernel-ppc64le-debug.config:CONFIG_PERF_EVENTS=y
configs/kernel-ppc64le.config:CONFIG_PERF_EVENTS=y
configs/kernel-s390x-debug.config:CONFIG_PERF_EVENTS=y
configs/kernel-s390x-zfcpdump.config:# CONFIG_PERF_EVENTS is not set
configs/kernel-s390x.config:CONFIG_PERF_EVENTS=y
configs/kernel-x86_64-debug.config:CONFIG_PERF_EVENTS=y
configs/kernel-x86_64.config:CONFIG_PERF_EVENTS=y

The commit "flipping the order of child_list and sibling_list" which introduced the bug - fa8c269 - is present in ciqlts8_6's history. The fixing commit 32671e3 is missing and wasn't backported.

Solution

The mainline fix 32671e3 adds a new group_generation field to the perf_event struct. This breaks LTS 8.6 kABI.

Investigation

The check-kabi program reports 57 symbols changed: __lock_page, __module_get, __page_file_index, __pagevec_release, __put_page, __put_task_struct, __scsi_iterate_devices, __task_pid_nr_ns, alloc_pages_current, blk_alloc_queue, blk_cleanup_queue, blk_get_request, blk_put_request, blk_queue_flag_clear, blk_queue_flag_set, filemap_fault, find_pid_ns, find_vma, flush_signals, force_sig, generic_make_request, get_task_mm, kernel_recvmsg, kernel_sendmsg, kernel_setsockopt, kmem_cache_create, kmem_cache_create_usercopy, kmem_cache_destroy, kmem_cache_shrink, mark_page_accessed, mmput, module_put, module_refcount, pagevec_lookup_range, pagevec_lookup_range_tag, pid_task, read_cache_pages, sched_setscheduler, scsi_change_queue_depth, scsi_device_get, scsi_device_lookup, scsi_device_put, scsi_get_vpd_page, send_sig, set_cpus_allowed_ptr, set_page_dirty, set_page_dirty_lock, set_user_nice, sock_create_kern, sock_release, starget_for_each_device, submit_bio, try_module_get, unlock_page, wait_on_page_bit, wake_up_process, write_cache_pages.

A useful tool exists developed by a RH engineer to assist in debugging the kABI breakage issues specifically: kabi-dw (https://github.com/skozina/kabi-dw). It was used to establish the relation between the changed whitelisted symbols with the modified perf_event structure to asses whether a workaround may be devised.

The tool requires the kernel and the modules to be compiled with CONFIG_DEBUG_INFO option set (it's set by default in all configs of ciqlts8_6). After the binaries are generated their .debug_info section is read and all the symbols are dumped into separate text files, for example func--kernel_setsockopt.txt:

Version: 1.0
File: /mnt/code/kernel-src-tree-ciqlts8_6-CVE-2023-5717-kabibreak/net/socket.c:3735
Symbol:
func kernel_setsockopt (
sock * @"struct--socket.txt"
level "int"
optname "int"
optval * "char"
optlen "unsigned int"
)
"int"

Full output: kabi-ciqlts8_6-CVE-2023-5717.tar.gz

As is visible in the example above in the

sock * @"struct--socket.txt"

line, the elementary symbols of the compound types are referenced in the form of @"‹symbol-file›". This defines a directed graph with symbol files as vertices. Searching for paths from each of the changed whitelisted symbol's file (eg. func--scsi_device_get.txt) to the patch-modified symbol struct--perf_event.txt would show how the CVE-2023-5717 patch breaks the kABI. This was easily achieved using python's implicit graph library nographs (https://nographs.readthedocs.io/en/latest/). While the full picture would include all paths, only the shortest ones were explored for simplicity:

perf_event-paths.txt

The file contains sections like

alloc_pages_current:
func--alloc_pages_current.txt
struct--page.txt
struct--mem_cgroup.txt
struct--task_struct.txt
struct--thread_struct.txt
struct--perf_event.txt

This exhibits that the alloc_pages_current symbol is a function, which is defined using the page struct (in this case as a return value, though it can't be known from this output, see below), which defines some field using mem_cgroup struct, which defines some field using the task_struct struct, which defines some field using the thread_struct struct, which defines some field using the perf_event struct that changed its definition in the patch.

While the format above gives a good overview, for the actual understanding required to properly asess the impact of perf_event change it's important to know
how the symbols are used (eg. as a pointer vs as a static field). This can be quite easily shown by grep -C-ing the symbol files for the next symbol in the usage chain:

perf_event-paths-context.txt

The entries take the following form (note that the chain order is reversed):

================================================================
alloc_pages_current:
[struct] perf_event

[struct] thread_struct
0x26    short unsigned int gsindex;
0x28    long unsigned int fsbase;
0x30    long unsigned int gsbase;
0x38    struct perf_event *ptrace_bps[4];
0x58    long unsigned int debugreg6;
0x60    long unsigned int ptrace_dr7;
0x68    long unsigned int cr2;

[struct] task_struct
                union {
                };
        };
0x1380  struct thread_struct thread;
};

[struct] mem_cgroup
0xee0   struct list_head objcg_list;
0xf00   struct memcg_padding _pad2_;
0xf00   atomic_t moving_account;
0xf08   struct task_struct *move_lock_task;
0xf10   struct memcg_vmstats_percpu *vmstats_percpu;
0xf18   struct list_head cgwb_list;
0xf28   struct wb_domain cgwb_domain;

[struct] page
0x38    union {
                long unsigned int memcg_data;
                struct {
0x0                     struct mem_cgroup *mem_cgroup;
                }rh_kabi_hidden_207;
                union {
                };

[func] alloc_pages_current
struct page *alloc_pages_current(
        gfp_t gfp,
        unsigned int order
);

This shows that:

  1. The ptrace_bps array in thread_struct doesn't contain perf_event elements but the pointers to perf_event elements, so the memory layout changes don't proliferate to the thread_struct or any further.
  2. The modified perf_event struct is behind a chain of 3 pointer dereferences from the object returned by the whitelisted symbol alloc_pages_current (pagemem_cgrouptask_structperf_event).

Analysis

Analyzing the files obtained in the previous steps results in the following observations:

  1. All modified whitelisted symbols are functions.

  2. In 42 out of 57 cases the perf_event struct is used as a pointer in the thread_struct (see example above).

    0x38    struct perf_event *ptrace_bps[4];
    
  3. In the rest 15 out of 57 cases the perf_event struct is used in the trace_event_call struct as a pointer argument to the function pointer field:

    0x88    int (*perf_perm)(
                    struct trace_event_call *,
                    struct perf_event *
            );
    };
    
  4. The number of pointer dereferences to get to the perf_event struct at least 2.

Points 2 and 3 stem from perf_event struct having the property of being strictly allocated from kernel's heap through the perf_event_alloc(…) function, which means that it's always dynamically allocated and referenced indirectly through a pointer.

Following "A Kernel Developer's Guide to kABI" (https://gitlab.com/redhat/centos-stream/src/kernel/documentation/-/tree/main/content/docs/kABI?ref_type=heads):

Fortunately, ’struct pci_dev’ meets very strict requirements that allows new members to be logically added to it: it is strictly allocated from the kernel’s heap via a common function - ’pci_alloc_dev()’. 3 If a data structure is strictly allocated from the kernel’s heap (i.e., no static definitions of the structure are allowed, nor are automatic allocations on the stack) then there is a technique that can be used for adding new members to it without breaking kABI

While the kABI formally must be broken, it can effectively be preserved given the conditions above. This effective kABI preservation can be signaled to the kabi-checking tools through the RH_KABI_EXTEND macro (include/linux/rh_kabi.h):

* RH_KABI_EXTEND
* Adds a new field to a struct. This must always be added to the end of
* the struct. Before using this macro, make sure this is actually safe
* to do - there is a number of conditions under which it is not safe.
* In particular (but not limited to), this macro cannot be used:
* - if the struct in question is embedded in another struct, or
* - if the struct is allocated by drivers either statically or
* dynamically, or
* - if the struct is allocated together with driver data (an example of
* such behavior is struct net_device or struct request).

All three conditions mentioned are met. While it's said that the list is not exhaustive, the full picture of perf_event's impact on whitelisted symbols provided before strenghtens the argument that the structure can be safely extended.

Regarding the second condition, while it can't be known for sure that no driver will use the perf_event struct on the stack, for example, this situation would be outside of scope. Following the guide again:

Intuitive, or experienced, developers will eventually ask: “What prevents an external driver module from allocating a ’struct pci_dev’ within itself, either statically (within the source file’s scope) or automatically (as an auto5matic variable within a routine; thus resides on the routine’s stack)?” Nothing, however if one researches, via ’cscope(1)’ or some similar tool, all ’struct pci_dev’ allocations within the kernel and it’s built-in modules, they will see that all allocations of ’pci_dev’ utilize ’pci_alloc_dev()’. This is where kABI enters a “grey area”: nothing prevents an external driver module from doing such, but, it would be considered ill formed; improper programming by a module is not considered a kABI breakage.

Solution

The group_generation field added in the mainline fix 32671e3 was preserved, but moved to the end of the struct and wrapped in the RH_KABI_EXTEND macro.

Additionally, a fix-of-the-fix on the mainlie was commited in a71ef31 which was also included in this backport.

kABI check: passed

++ uname -m
+ python3 /data/src/ctrliq-github/kernel-dist-git-el-8.6/SOURCES/check-kabi -k /data/src/ctrliq-github/kernel-dist-git-el-8.6/SOURCES/Module.kabi_x86_64 -s vms/x86_64--build--ciqlts8_6/build_files/kernel-src-tree-ciqlts8_6-CVE-2023-5717/Module.symvers
kABI check passed
+ touch state/kernels/ciqlts8_6-CVE-2023-5717/x86_64/kabi_checked

Boot test: passed

boot-test.log

Kselftests: passed relative

Coverage

android, bpf (except test_sockmap, test_maps, test_progs-no_alu32, test_progs, test_xsk.sh, test_kmod.sh), breakpoints (except step_after_suspend_test), capabilities, core, cpu-hotplug, cpufreq, exec, firmware, fpu, ftrace, futex, gpio, intel_pstate, ipc, kcmp, kexec, kvm, lib, livepatch, membarrier, memfd, memory-hotplug, mount, net/forwarding (except sch_tbf_prio.sh, ipip_hier_gre_keys.sh, mirror_gre_vlan_bridge_1q.sh, sch_ets.sh, tc_actions.sh, mirror_gre_bridge_1d_vlan.sh, sch_tbf_root.sh, sch_tbf_ets.sh), net/mptcp (except simult_flows.sh, mptcp_join.sh), net (except ip_defrag.sh, reuseport_addr_any.sh, udpgro_fwd.sh, gro.sh, txtimestamp.sh, xfrm_policy.sh, udpgso_bench.sh), netfilter (except nft_trans_stress.sh), nsfs, proc, pstore, ptrace, rseq, sgx, sigaltstack, size, splice, static_keys, tc-testing, timens, timers (except raw_skew), tpm2, vm, x86, zram

Reference

kselftests–ciqlts8_6–run1.log
kselftests–ciqlts8_6–run2.log

Patch

kselftests–ciqlts8_6-CVE-2023-5717–run1.log
kselftests–ciqlts8_6-CVE-2023-5717–run2.log
kselftests–ciqlts8_6-CVE-2023-5717–run3.log

Comparison

The patch and reference results are the same.

ktests.xsh diff -d kselftests*.log

Column    File
--------  ---------------------------------------------
Status0   kselftests--ciqlts8_6--run1.log
Status1   kselftests--ciqlts8_6--run2.log
Status2   kselftests--ciqlts8_6-CVE-2023-5717--run1.log
Status3   kselftests--ciqlts8_6-CVE-2023-5717--run2.log
Status4   kselftests--ciqlts8_6-CVE-2023-5717--run3.log

Specific tests: passed

While not strictly testing the provided patch, a very basic sanity check of the perf_events module was done to see if it remains functional.

Reference

$ uname -r 
4.18.0-ciqlts8_6
$ sudo perf stat -B dd if=/dev/zero of=/dev/null count=1000000

1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.40199 s, 365 MB/s

 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':

          1,403.53 msec task-clock                #    0.995 CPUs utilized          
                 3      context-switches          #    2.137 /sec                   
                 1      cpu-migrations            #    0.712 /sec                   
                68      page-faults               #   48.449 /sec                   
     5,967,831,438      cycles                    #    4.252 GHz                    
     2,734,328,407      instructions              #    0.46  insn per cycle         
       582,337,347      branches                  #  414.910 M/sec                  
         8,282,770      branch-misses             #    1.42% of all branches        

       1.411009012 seconds time elapsed

       0.685598000 seconds user
       0.717428000 seconds sys

Patch

$ uname -r 
4.18.0-ciqlts8_6-CVE-2023-5717
$ sudo perf stat -B dd if=/dev/zero of=/dev/null count=1000000

1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.38342 s, 370 MB/s

 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':

          1,385.24 msec task-clock                #    0.990 CPUs utilized          
                 2      context-switches          #    1.444 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                68      page-faults               #   49.089 /sec                   
     5,954,335,126      cycles                    #    4.298 GHz                    
     2,734,200,695      instructions              #    0.46  insn per cycle         
       582,312,264      branches                  #  420.369 M/sec                  
         8,210,542      branch-misses             #    1.41% of all branches        

       1.398899406 seconds time elapsed

       0.643509000 seconds user
       0.742171000 seconds sys

jira VULN-4127
cve CVE-2023-5717
commit-author Peter Zijlstra <[email protected]>
commit 32671e3
upstream-diff The mainline fix 32671e3
  adds a new `group_generation' field to the `perf_event' struct. This
  breaks LTS 8.6 kABI. The new field was preserved, but moved to the end
  of the struct and wrapped in the `RH_KABI_EXTEND' macro. The kABI in
  this particular case is preserved, as the `perf_event' struct is always
  dynamically allocated through `perf_event_alloc()' and used indirectly
  through a pointer. It's not used as a field in any other struct nor
  as an array element.

Because group consistency is non-atomic between parent (filedesc) and children
(inherited) events, it is possible for PERF_FORMAT_GROUP read() to try and sum
non-matching counter groups -- with non-sensical results.

Add group_generation to distinguish the case where a parent group removes and
adds an event and thus has the same number, but a different configuration of
events as inherited groups.

This became a problem when commit fa8c269 ("perf/core: Invert
perf_read_group() loops") flipped the order of child_list and sibling_list.
Previously it would iterate the group (sibling_list) first, and for each
sibling traverse the child_list. In this order, only the group composition of
the parent is relevant. By flipping the order the group composition of the
child (inherited) events becomes an issue and the mis-match in group
composition becomes evident.

That said; even prior to this commit, while reading of a group that is not
equally inherited was not broken, it still made no sense.

(Ab)use ECHILD as error return to indicate issues with child process group
composition.

Fixes: fa8c269 ("perf/core: Invert perf_read_group() loops")
	Reported-by: Budimir Markovic <[email protected]>
	Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
(cherry picked from commit 32671e3)
	Signed-off-by: Marcin Wcisło <[email protected]>
jira VULN-4127
cve-bf CVE-2023-5717
commit-author Peter Zijlstra <[email protected]>
commit a71ef31

Smatch is awesome.

Fixes: 32671e3 ("perf: Disallow mis-matched inherited group reads")
	Reported-by: Dan Carpenter <[email protected]>
	Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
	Signed-off-by: Ingo Molnar <[email protected]>
(cherry picked from commit a71ef31)
	Signed-off-by: Marcin Wcisło <[email protected]>
Copy link
Collaborator

@bmastbergen bmastbergen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥌

Copy link

@thefossguy-ciq thefossguy-ciq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚤

@PlaidCat PlaidCat merged commit 281dd44 into ctrliq:ciqlts8_6 Aug 13, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants