Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

Commit dceff94

Browse files
jikunshangbigPYJ1151abhilash1910
authored andcommitted
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (vllm-project#3814)
Co-authored-by: Jiang Li <[email protected]> Co-authored-by: Abhilash Majumder <[email protected]> Co-authored-by: Abhilash Majumder <[email protected]>
1 parent 61f421b commit dceff94

31 files changed

+1998
-24
lines changed

.buildkite/run-xpu-test.sh

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# This script build the CPU docker image and run the offline inference inside the container.
2+
# It serves a sanity check for compilation and basic model usage.
3+
set -ex
4+
5+
# Try building the docker image
6+
docker build -t xpu-test -f Dockerfile.xpu .
7+
8+
# Setup cleanup
9+
remove_docker_container() { docker rm -f xpu-test || true; }
10+
trap remove_docker_container EXIT
11+
remove_docker_container
12+
13+
# Run the image and launch offline inference
14+
docker run --network host --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path xpu-test python3 examples/offline_inference.py

.buildkite/test-template.j2

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,11 @@ steps:
4545
queue: intel
4646
command: bash .buildkite/run-cpu-test.sh
4747

48+
- label: "XPU Test"
49+
agents:
50+
queue: intel
51+
command: bash .buildkite/run-xpu-test.sh
52+
4853
{% for step in steps %}
4954
- label: "{{ step.label }}"
5055
agents:

Dockerfile.xpu

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
FROM intel/oneapi-basekit:2024.1.0-devel-ubuntu22.04
2+
3+
RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/intel-oneapi-archive-keyring.gpg > /dev/null && \
4+
echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main " | tee /etc/apt/sources.list.d/oneAPI.list && \
5+
chmod 644 /usr/share/keyrings/intel-oneapi-archive-keyring.gpg && \
6+
rm /etc/apt/sources.list.d/intel-graphics.list && \
7+
wget -O- https://repositories.intel.com/graphics/intel-graphics.key | gpg --dearmor | tee /usr/share/keyrings/intel-graphics.gpg > /dev/null && \
8+
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc" | tee /etc/apt/sources.list.d/intel.gpu.jammy.list && \
9+
chmod 644 /usr/share/keyrings/intel-graphics.gpg
10+
11+
RUN apt-get update -y \
12+
&& apt-get install -y curl libicu70 lsb-release git wget vim numactl python3 python3-pip
13+
14+
COPY ./ /workspace/vllm
15+
16+
WORKDIR /workspace/vllm
17+
18+
RUN pip install -v -r requirements-xpu.txt
19+
20+
RUN VLLM_TARGET_DEVICE=xpu python3 setup.py install
21+
22+
CMD ["/bin/bash"]

benchmarks/benchmark_latency.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,7 @@ def run_to_completion(profile_dir: Optional[str] = None):
191191
"--device",
192192
type=str,
193193
default="cuda",
194-
choices=["cuda", "cpu", "tpu"],
194+
choices=["cuda", "cpu", "tpu", "xpu"],
195195
help='device type for vLLM execution, supporting CUDA and CPU.')
196196
parser.add_argument('--block-size',
197197
type=int,

benchmarks/benchmark_throughput.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -349,7 +349,7 @@ def main(args: argparse.Namespace):
349349
"--device",
350350
type=str,
351351
default="cuda",
352-
choices=["cuda", "cpu", "tpu"],
352+
choices=["cuda", "cpu", "tpu", "xpu"],
353353
help='device type for vLLM execution, supporting CUDA and CPU.')
354354
parser.add_argument(
355355
"--enable-prefix-caching",
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
.. _installation_xpu:
2+
3+
Installation with XPU
4+
========================
5+
6+
vLLM initially supports basic model inferencing and serving on Intel GPU platform.
7+
8+
Table of contents:
9+
10+
#. :ref:`Requirements <xpu_backend_requirements>`
11+
#. :ref:`Quick start using Dockerfile <xpu_backend_quick_start_dockerfile>`
12+
#. :ref:`Build from source <build_xpu_backend_from_source>`
13+
14+
.. _xpu_backend_requirements:
15+
16+
Requirements
17+
------------
18+
19+
* OS: Linux
20+
* Supported Hardware: Intel Data Center GPU (Intel ARC GPU WIP)
21+
* OneAPI requirements: oneAPI 2024.1
22+
23+
.. _xpu_backend_quick_start_dockerfile:
24+
25+
Quick start using Dockerfile
26+
----------------------------
27+
28+
.. code-block:: console
29+
30+
$ docker build -f Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
31+
$ docker run -it \
32+
--rm \
33+
--network=host \
34+
--device /dev/dri \
35+
-v /dev/dri/by-path:/dev/dri/by-path \
36+
vllm-xpu-env
37+
38+
.. _build_xpu_backend_from_source:
39+
40+
Build from source
41+
-----------------
42+
43+
- First, install required driver and intel OneAPI 2024.1.
44+
45+
- Second, install Python packages for vLLM XPU backend building:
46+
47+
.. code-block:: console
48+
49+
$ pip install --upgrade pip
50+
$ pip install -v -r requirements-xpu.txt
51+
52+
- Finally, build and install vLLM XPU backend:
53+
54+
.. code-block:: console
55+
56+
$ VLLM_TARGET_DEVICE=xpu python setup.py install
57+
58+
.. note::
59+
- FP16 is the default data type in the current XPU backend. The BF16 data
60+
type will be supported in the future.
61+

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ Documentation
6666
getting_started/cpu-installation
6767
getting_started/neuron-installation
6868
getting_started/tpu-installation
69+
getting_started/xpu-installation
6970
getting_started/quickstart
7071
getting_started/debugging
7172
getting_started/examples/examples_index

requirements-xpu.txt

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Common dependencies
2+
-r requirements-common.txt
3+
4+
setuptools < 70.0.0 # IPEX's torch have some dependency. to be removed.
5+
6+
torch @ https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_dev/xpu/torch-2.1.0.post1%2Bcxx11.abi-cp310-cp310-linux_x86_64.whl
7+
intel_extension_for_pytorch @ https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_dev/xpu/intel_extension_for_pytorch-2.1.30a0-cp310-cp310-linux_x86_64.whl
8+
oneccl_bind_pt @ https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/oneccl_bind_pt-2.1.200%2Bxpu-cp310-cp310-linux_x86_64.whl
9+
10+
triton @ https://github.com/intel/intel-xpu-backend-for-triton/releases/download/v2.1.0/triton-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
11+

setup.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,10 @@ def _is_cpu() -> bool:
234234
return VLLM_TARGET_DEVICE == "cpu"
235235

236236

237+
def _is_xpu() -> bool:
238+
return VLLM_TARGET_DEVICE == "xpu"
239+
240+
237241
def _build_custom_ops() -> bool:
238242
return _is_cuda() or _is_hip() or _is_cpu()
239243

@@ -357,6 +361,8 @@ def get_vllm_version() -> str:
357361
version += "+tpu"
358362
elif _is_cpu():
359363
version += "+cpu"
364+
elif _is_xpu():
365+
version += "+xpu"
360366
else:
361367
raise RuntimeError("Unknown runtime environment")
362368

@@ -406,6 +412,8 @@ def _read_requirements(filename: str) -> List[str]:
406412
requirements = _read_requirements("requirements-tpu.txt")
407413
elif _is_cpu():
408414
requirements = _read_requirements("requirements-cpu.txt")
415+
elif _is_xpu():
416+
requirements = _read_requirements("requirements-xpu.txt")
409417
else:
410418
raise ValueError(
411419
"Unsupported platform, please use CUDA, ROCm, Neuron, or CPU.")

vllm/_custom_ops.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -373,7 +373,8 @@ def reshape_and_cache_flash(
373373
kv_cache_dtype)
374374

375375

376-
def copy_blocks(key_caches: torch.Tensor, value_caches: torch.Tensor,
376+
def copy_blocks(key_caches: List[torch.Tensor],
377+
value_caches: List[torch.Tensor],
377378
block_mapping: torch.Tensor) -> None:
378379
torch.ops._C_cache_ops.copy_blocks(key_caches, value_caches, block_mapping)
379380

0 commit comments

Comments
 (0)