A Vulkan 1.2 application to experiment with preemption and core isolation.
To ensure a smooth user experience in multi-tasking systems where multiple applications and services share the same GPU resources, lower-priority GPU work sometimes has to be preempted by more crucial tasks like compositor work.
While preemption is designed to prioritize critical tasks, a lack of efficient preemption can lead to significant delays, preventing high-priority workloads from executing in a timely manner.
vkpreempt
is designed to experiment with and understand GPU preemption and its effects as well as to explore the pros and cons of reserving GPU resources for higher-priority tasks.
vkpreempt
runs graphics or compute GPU tasks at regular intervals.
Each iteration of GPU work is aligned to multiples of the interval since the system clock's epoch such that two concurrent executions with the same arguments schedule their workloads to run at roughly the same time.
The following dependencies are downloaded, built and linked automatically by CMake:
- argparse (v3.2) for parsing command line arguments. Published under MIT license.
- Perfetto (v51.2) for emitting GPU task submissions and executions as Perfetto trace events (optional).
Only used if
vkpreempt
is built with the CMake optionVKPREEMPT_ENABLE_PERFETTO_TRACES
enabled. Published under Apache 2.0 license. - glfw (3.4) for rendering into a native window instead of running in headless mode (optional).
Only used if
vkpreempt
is build witht the CMake optionsVKPREEMPT_ENABLE_SURFACE
andVKPREEMPT_USE_NATIVE_WINDOW
enabled. Published under Zlib license.
cmake --preset default
cmake --build build
The executable vkpreempt
supports two subcommands to run graphics and compute tasks respectively.
Runs a graphics task rendering multiple octaves of noise in a full-screen 2D grid with n cells.
The graphics work can be scaled on three orthogonal dimensions:
- Vertex shader invocations The sample renders a full-screen grid of n x n cells. The number of vertex shader invocations is increased by increasing the number of cells per row n.
- Fragment shader invocations The number of fragment shader invocations is controlled by the output resolution.
- Fragment shader load Each fragment shader invocation computes m octaves of noise. To increase the workload, increase the number of octaves.
Frames are scheduled to be rendered at regular intervals. Each frame is aligned to multiples of the interval since the system clock's epoch such that two concurrent executions with the same arguments schedule frames to be rendered at the same time.
Usage: vkpreempt graphics [--help] [--version] [--width VAR] [--height VAR] [--cells VAR] [--loops VAR] [--interval VAR] [--offset VAR] [--global-priority] [--cpu VAR]
Optional arguments:
-h, --help shows help message and exits
-v, --version prints version information and exits
-w, --width the width of the window [nargs=0..1] [default: 800]
-H, --height the height of the window [nargs=0..1] [default: 600]
-c, --cells the number of grid cells per row and column (cells x cells) - increase to scale up the number of vertex shader invocations [nargs=0..1] [default: 16]
-l, --loops the number of loop iterations to execute in the fragment shader - increase to scale up the fragment workload [nargs=0..1] [default: 1]
-i, --interval the interval in milliseconds to schedule and align each frame with [nargs=0..1] [default: 16]
-o, --offset the offset in nanoseconds from the scheduling inverval [nargs=0..1] [default: 0]
-g, --global-priority if this flag is set, sample's GPU queue is created with the maximum system-wide global priority
--cpu the index of the CPU core to pin this sample to (ignored if not supported)
Runs a compute task operating on n elements.
The compute work can be scaled on two orthogonal dimensions:
- Compute shader load Each compute shader invocation computes m octaves of noise. To increase the workload, increase the number of octaves.
- Compute shader invocations The number of compute shader invocations is controlled by the element count and the workgroup size.
Compute tasks are scheduled to be executed at regular intervals. Each compute job is aligned to multiples of the interval since the system clock's epoch such that two concurrent executions with the same arguments schedule their jobs to be run at the same time.
Usage: vkpreempt compute [--help] [--version] [--num-elements VAR] [--workgroup-size VAR] [--loops VAR] [--interval VAR] [--offset VAR] [--global-priority] [--cpu VAR]
Optional arguments:
-h, --help shows help message and exits
-v, --version prints version information and exits
-n, --num-elements the number of elements to operate on - increase to scale up the number of compute shader invocations [nargs=0..1] [default: 480000]
-w, --workgroup-size the size of each workgroup [nargs=0..1] [default: 256]
-l, --loops the number of loop iterations to execute in the compute shader - increase to scale up the compute workload [nargs=0..1] [default: 1]
-i, --interval the interval in milliseconds to schedule and align each frame with [nargs=0..1] [default: 16]
-o, --offset the offset in nanoseconds from the scheduling inverval [nargs=0..1] [default: 0]
-g, --global-priority if this flag is set, sample's GPU queue is created with the maximum system-wide global priority
--cpu the index of the CPU core to pin this sample to (ignored if not supported)
For a GPU workload to preempt another one, it needs to have a higher priority than the other. The Vulkan way of achieving this is by submitting it to a queue that has a system-wide global priority that is higher than the queue the other workload was submitted to. The device extension to query the system-scoped queue priorities available on a physical device as well as creating a queue with a certain system-wide priority is called VK_KHR_global_priority (Note: this extension has been promoted to core in Vulkan 1.4). In the sample application, we simply choose the highest global priority available if the corresponding command line option is given.
Some ARM Mali GPUs support core isolation on a software level. The idea is to partition GPU cores or groups of cores to submit work to dedicated partitions. A possible use case of this feature is to reserve GPU resources for high-priority tasks while low-priority tasks can run on the remaining cores to avoid preemption delays.
Mesa's Panfrost driver exposes a way to configure a process with a particular core mask via driconf. With that the environment variables pan_fragment_core_mask
and pan_compute_core_mask
can be used to enable / disable GPU cores.
While applications can not explicitly submit work to cores or groups of cores on an API level, this makes it possible to specify core partitions available to an application during its whole runtime.
We're using Perfetto traces to investigate effects of core isolation with PanVK. To get started with tracing on PanVK make yourself familiar with Mesa's guide.
To start tracing with GPU counters enabled ...
- Start
pps-producer
(for GPU counters) - Start executable with
MESA_GPU_TRACES=perfetto
and core mask usingpan_fragment_core_mask=<mask>
andpan_compute_core_mask=<mask>
- Start tracing
In our experiments, we use a Radxa ROCK 5 Model B with a 4-core Mali G610. The allowed values for PanVK's core mask in this configuration are:
0x00001
(Core 0)0x00004
(Core 2)0x10000
(Core 16)0x40000
(Core 18)- any combination of the above (e.g.,
0x40005
for cores 1, 2, and 18)