Skip to content

GPU: Add support for the new xe KMD #1670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/lib-e2e.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ jobs:
- name: e2e-gpu
runner: gpu
images: intel-gpu-plugin intel-gpu-initcontainer
targetJob: e2e-gpu SKIP=Resource:xe
- name: e2e-iaa-spr
targetjob: e2e-iaa
runner: simics-spr
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@ The summary of resources available via plugins in this repository is given in th
* [dsa-accel-config-demo-pod.yaml](demo/dsa-accel-config-demo-pod.yaml)
* `fpga.intel.com` : custom, see [mappings](cmd/fpga_admissionwebhook/README.md#mappings)
* [intelfpga-job.yaml](demo/intelfpga-job.yaml)
* `gpu.intel.com` : `i915`
* `gpu.intel.com` : `i915`, `i915_monitoring`, `xe` or `xe_monitoring`
* [intelgpu-job.yaml](demo/intelgpu-job.yaml)
* `iaa.intel.com` : `wq-user-[shared or dedicated]`
* [iaa-accel-config-demo-pod.yaml](demo/iaa-accel-config-demo-pod.yaml)
Expand Down
40 changes: 39 additions & 1 deletion cmd/gpu_plugin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Table of Contents
* [Running GPU plugin as non-root](#running-gpu-plugin-as-non-root)
* [Labels created by GPU plugin](#labels-created-by-gpu-plugin)
* [SR-IOV use with the plugin](#sr-iov-use-with-the-plugin)
* [KMD and UMD](#kmd-and-umd)
* [Issues with media workloads on multi-GPU setups](#issues-with-media-workloads-on-multi-gpu-setups)
* [Workaround for QSV and VA-API](#workaround-for-qsv-and-va-api)

Expand All @@ -36,11 +37,23 @@ For example containers with Intel media driver (and components using that), can
video transcoding operations, and containers with the Intel OpenCL / oneAPI Level Zero
backend libraries can offload compute operations to GPU.

Intel GPU plugin may register four node resources to the Kubernetes cluster:
| Resource | Description |
|:---- |:-------- |
| gpu.intel.com/i915 | GPU instance running legacy `i915` KMD |
| gpu.intel.com/i915_monitoring | Monitoring resource for the legacy `i915` KMD devices |
| gpu.intel.com/xe | GPU instance running new `xe` KMD |
| gpu.intel.com/xe_monitoring | Monitoring resource for the new `xe` KMD devices |

While GPU plugin basic operations support nodes having both (`i915` and `xe`) KMDs on the same node, its resource management (=GAS) does not, for that node needs to have only one of the KMDs present.

For workloads on different KMDs, see [KMD and UMD](#kmd-and-umd).

## Modes and Configuration Options

| Flag | Argument | Default | Meaning |
|:---- |:-------- |:------- |:------- |
| -enable-monitoring | - | disabled | Enable 'i915_monitoring' resource that provides access to all Intel GPU devices on the node |
| -enable-monitoring | - | disabled | Enable '*_monitoring' resource that provides access to all Intel GPU devices on the node, [see use](./monitoring.md) |
| -resource-manager | - | disabled | Enable fractional resource management, [see use](./fractional.md) |
| -shared-dev-num | int | 1 | Number of containers that can share the same GPU device |
| -allocation-policy | string | none | 3 possible values: balanced, packed, none. For shared-dev-num > 1: _balanced_ mode spreads workloads among GPU devices, _packed_ mode fills one GPU fully before moving to next, and _none_ selects first available device from kubelet. Default is _none_. Allocation policy does not have an effect when resource manager is enabled. |
Expand Down Expand Up @@ -205,6 +218,31 @@ GPU plugin does __not__ setup SR-IOV. It has to be configured by the cluster adm

GPU plugin does however support provisioning Virtual Functions (VFs) to containers for a SR-IOV enabled GPU. When the plugin detects a GPU with SR-IOV VFs configured, it will only provision the VFs and leaves the PF device on the host.

### KMD and UMD

There are 3 different Kernel Mode Drivers (KMD) available: `i915 upstream`, `i915 backport` and `xe`:
* `i915 upstream` is a vanilla driver that comes from the upstream kernel and is included in the common Linux distributions, like Ubuntu.
* `i915 backport` is an [out-of-tree driver](https://github.com/intel-gpu/intel-gpu-i915-backports/) for older enterprise / LTS kernel versions, having better support for new HW before upstream kernel does. API it provides to user-space can differ from the eventual upstream version.
* `xe` is a new KMD that is intended to support future GPUs. While it has [experimental support for latest current GPUs](https://docs.kernel.org/gpu/rfc/xe.html) (starting from Tigerlake), it will not support them officially.

For optimal performance, the KMD should be paired with the same UMD variant. When creating a workload container, depending on the target hardware, the UMD packages should be selected approriately.

| KMD | UMD packages | Support notes |
|:---- |:-------- |:------- |
| `i915 upstream` | Distro Repository | For Integrated GPUs. Newer Linux kernels will introduce support for Arc, Flex or Max series. |
| `i915 backport` | [Intel Repository](https://dgpu-docs.intel.com/driver/installation.html#install-steps) | Best for Arc, Flex and Max series. Untested for Integrated GPUs. |
| `xe` | Source code only | Experimental support for Arc, Flex and Max series. |

> *NOTE*: Xe UMD is in active development and should be considered as experimental.

Creating a workload that would support all the different KMDs is not currently possible. Below is a table that clarifies how each domain supports different KMDs.

| Domain | i915 upstream | i915 backport | xe | Notes |
|:---- |:-------- |:------- |:------- |:------- |
| Compute | Default | [NEO_ENABLE_i915_PRELIM_DETECTION](https://github.com/intel/compute-runtime/blob/3341de7a0d5fddd2ea5f505b5d2ef5c13faa0681/CMakeLists.txt#L496-L502) | [NEO_ENABLE_XE_DRM_DETECTION](https://github.com/intel/compute-runtime/blob/3341de7a0d5fddd2ea5f505b5d2ef5c13faa0681/CMakeLists.txt#L504-L510) | All three KMDs can be supported at the same time. |
| Media | Default | [ENABLE_PRODUCTION_KMD](https://github.com/intel/media-driver/blob/a66b076e83876fbfa9c9ab633ad9c5517f8d74fd/CMakeLists.txt#L58) | [ENABLE_XE_KMD](https://github.com/intel/media-driver/blob/a66b076e83876fbfa9c9ab633ad9c5517f8d74fd/media_driver/cmake/linux/media_feature_flags_linux.cmake#L187-L190) | Xe with upstream or backport i915, not all three. |
| Graphics | Default | Unknown | [intel-xe-kmd](https://gitlab.freedesktop.org/mesa/mesa/-/blob/e9169881dbd1f72eab65a68c2b8e7643f74489b7/meson_options.txt#L708) | i915 and xe KMDs can be supported at the same time. |

### Issues with media workloads on multi-GPU setups

OneVPL media API, 3D and compute APIs provide device discovery
Expand Down
85 changes: 85 additions & 0 deletions cmd/gpu_plugin/device_props.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
// Copyright 2024 Intel Corporation. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package main

import (
"slices"

"github.com/intel/intel-device-plugins-for-kubernetes/cmd/internal/labeler"
"github.com/intel/intel-device-plugins-for-kubernetes/cmd/internal/pluginutils"
"k8s.io/klog/v2"
)

type DeviceProperties struct {
currentDriver string
drmDrivers map[string]bool
tileCounts []uint64
isPfWithVfs bool
}

type invalidTileCountErr struct {
error
}

func newDeviceProperties() *DeviceProperties {
return &DeviceProperties{
drmDrivers: make(map[string]bool),
}
}

func (d *DeviceProperties) fetch(cardPath string) {
d.isPfWithVfs = pluginutils.IsSriovPFwithVFs(cardPath)

d.tileCounts = append(d.tileCounts, labeler.GetTileCount(cardPath))

driverName, err := pluginutils.ReadDeviceDriver(cardPath)
if err != nil {
klog.Warningf("card (%s) doesn't have driver, using default: %s", cardPath, deviceTypeDefault)

driverName = deviceTypeDefault
}

d.currentDriver = driverName
d.drmDrivers[d.currentDriver] = true
}

func (d *DeviceProperties) drmDriverCount() int {
return len(d.drmDrivers)
}

func (d *DeviceProperties) driver() string {
return d.currentDriver
}

func (d *DeviceProperties) monitorResource() string {
return d.currentDriver + monitorSuffix
}

func (d *DeviceProperties) maxTileCount() (uint64, error) {
if len(d.tileCounts) == 0 {
return 0, invalidTileCountErr{}
}

minCount := slices.Min(d.tileCounts)
maxCount := slices.Max(d.tileCounts)

if minCount != maxCount {
klog.Warningf("Node's GPUs are heterogenous (min: %d, max: %d tiles)", minCount, maxCount)

return 0, invalidTileCountErr{}
}

return maxCount, nil
}
Loading