Skip to content

Commit fe2e10c

Browse files
authored
Add example of helm chart for vllm deployment on k8s (#9199)
Signed-off-by: Maxime Fournioux <[email protected]>
1 parent 82c73fd commit fe2e10c

20 files changed

+1206
-0
lines changed
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
name: Lint and Deploy Charts
2+
3+
on: pull_request
4+
5+
jobs:
6+
lint-and-deploy:
7+
runs-on: ubuntu-latest
8+
steps:
9+
- name: Checkout
10+
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
11+
with:
12+
fetch-depth: 0
13+
14+
- name: Set up Helm
15+
uses: azure/setup-helm@fe7b79cd5ee1e45176fcad797de68ecaf3ca4814 # v4.2.0
16+
with:
17+
version: v3.14.4
18+
19+
#Python is required because ct lint runs Yamale and yamllint which require Python.
20+
- uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
21+
with:
22+
python-version: '3.13'
23+
24+
- name: Set up chart-testing
25+
uses: helm/chart-testing-action@e6669bcd63d7cb57cb4380c33043eebe5d111992 # v2.6.1
26+
with:
27+
version: v3.10.1
28+
29+
- name: Run chart-testing (lint)
30+
run: ct lint --target-branch ${{ github.event.repository.default_branch }} --chart-dirs examples/chart-helm --charts examples/chart-helm
31+
32+
- name: Setup minio
33+
run: |
34+
docker network create vllm-net
35+
docker run -d -p 9000:9000 --name minio --net vllm-net \
36+
-e "MINIO_ACCESS_KEY=minioadmin" \
37+
-e "MINIO_SECRET_KEY=minioadmin" \
38+
-v /tmp/data:/data \
39+
-v /tmp/config:/root/.minio \
40+
minio/minio server /data
41+
export AWS_ACCESS_KEY_ID=minioadmin
42+
export AWS_SECRET_ACCESS_KEY=minioadmin
43+
export AWS_EC2_METADATA_DISABLED=true
44+
mkdir opt-125m
45+
cd opt-125m && curl -O -Ls "https://huggingface.co/facebook/opt-125m/resolve/main/{pytorch_model.bin,config.json,generation_config.json,merges.txt,special_tokens_map.json,tokenizer_config.json,vocab.json}" && cd ..
46+
aws --endpoint-url http://127.0.0.1:9000/ s3 mb s3://testbucket
47+
aws --endpoint-url http://127.0.0.1:9000/ s3 cp opt-125m/ s3://testbucket/opt-125m --recursive
48+
49+
- name: Create kind cluster
50+
uses: helm/kind-action@0025e74a8c7512023d06dc019c617aa3cf561fde # v1.10.0
51+
52+
- name: Build the Docker image vllm cpu
53+
run: docker buildx build -f Dockerfile.cpu -t vllm-cpu-env .
54+
55+
- name: Configuration of docker images, network and namespace for the kind cluster
56+
run: |
57+
docker pull amazon/aws-cli:2.6.4
58+
kind load docker-image amazon/aws-cli:2.6.4 --name chart-testing
59+
kind load docker-image vllm-cpu-env:latest --name chart-testing
60+
docker network connect vllm-net "$(docker ps -aqf "name=chart-testing-control-plane")"
61+
kubectl create ns ns-vllm
62+
63+
- name: Run chart-testing (install)
64+
run: |
65+
export AWS_ACCESS_KEY_ID=minioadmin
66+
export AWS_SECRET_ACCESS_KEY=minioadmin
67+
helm install --wait --wait-for-jobs --timeout 5m0s --debug --create-namespace --namespace=ns-vllm test-vllm examples/chart-helm -f examples/chart-helm/values.yaml --set secrets.s3endpoint=http://minio:9000 --set secrets.s3bucketname=testbucket --set secrets.s3accesskeyid=$AWS_ACCESS_KEY_ID --set secrets.s3accesskey=$AWS_SECRET_ACCESS_KEY --set resources.requests.cpu=1 --set resources.requests.memory=4Gi --set resources.limits.cpu=2 --set resources.limits.memory=5Gi --set image.env[0].name=VLLM_CPU_KVCACHE_SPACE --set image.env[1].name=VLLM_LOGGING_LEVEL --set-string image.env[0].value="1" --set-string image.env[1].value="DEBUG" --set-string extraInit.s3modelpath="opt-125m/" --set-string 'resources.limits.nvidia\.com/gpu=0' --set-string 'resources.requests.nvidia\.com/gpu=0' --set-string image.repository="vllm-cpu-env"
68+
69+
- name: curl test
70+
run: |
71+
kubectl -n ns-vllm port-forward service/test-vllm-service 8001:80 &
72+
sleep 10
73+
CODE="$(curl -v -f --location http://localhost:8001/v1/completions \
74+
--header "Content-Type: application/json" \
75+
--data '{
76+
"model": "opt-125m",
77+
"prompt": "San Francisco is a",
78+
"max_tokens": 7,
79+
"temperature": 0
80+
}'):$CODE"
81+
echo "$CODE"

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ Documentation
8282
serving/openai_compatible_server
8383
serving/deploying_with_docker
8484
serving/deploying_with_k8s
85+
serving/deploying_with_helm
8586
serving/deploying_with_nginx
8687
serving/distributed_serving
8788
serving/metrics
968 KB
Loading
Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
.. _deploying_with_helm:
2+
3+
Deploying with Helm
4+
===================
5+
6+
A Helm chart to deploy vLLM for Kubernetes
7+
8+
Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLMm Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variables values.
9+
10+
This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm install and documentation on architecture and values file.
11+
12+
Prerequisites
13+
-------------
14+
Before you begin, ensure that you have the following:
15+
16+
- A running Kubernetes cluster
17+
- NVIDIA Kubernetes Device Plugin (``k8s-device-plugin``): This can be found at `https://github.com/NVIDIA/k8s-device-plugin <https://github.com/NVIDIA/k8s-device-plugin>`__
18+
- Available GPU resources in your cluster
19+
- S3 with the model which will be deployed
20+
21+
Installing the chart
22+
--------------------
23+
24+
To install the chart with the release name ``test-vllm``:
25+
26+
.. code-block:: console
27+
28+
helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3buckername=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
29+
30+
Uninstalling the Chart
31+
----------------------
32+
33+
To uninstall the ``test-vllm`` deployment:
34+
35+
.. code-block:: console
36+
37+
helm uninstall test-vllm --namespace=ns-vllm
38+
39+
The command removes all the Kubernetes components associated with the
40+
chart **including persistent volumes** and deletes the release.
41+
42+
Architecture
43+
------------
44+
45+
.. image:: architecture_helm_deployment.png
46+
47+
Values
48+
------
49+
50+
.. list-table:: Values
51+
:widths: 25 25 25 25
52+
:header-rows: 1
53+
54+
* - Key
55+
- Type
56+
- Default
57+
- Description
58+
* - autoscaling
59+
- object
60+
- {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}
61+
- Autoscaling configuration
62+
* - autoscaling.enabled
63+
- bool
64+
- false
65+
- Enable autoscaling
66+
* - autoscaling.maxReplicas
67+
- int
68+
- 100
69+
- Maximum replicas
70+
* - autoscaling.minReplicas
71+
- int
72+
- 1
73+
- Minimum replicas
74+
* - autoscaling.targetCPUUtilizationPercentage
75+
- int
76+
- 80
77+
- Target CPU utilization for autoscaling
78+
* - configs
79+
- object
80+
- {}
81+
- Configmap
82+
* - containerPort
83+
- int
84+
- 8000
85+
- Container port
86+
* - customObjects
87+
- list
88+
- []
89+
- Custom Objects configuration
90+
* - deploymentStrategy
91+
- object
92+
- {}
93+
- Deployment strategy configuration
94+
* - externalConfigs
95+
- list
96+
- []
97+
- External configuration
98+
* - extraContainers
99+
- list
100+
- []
101+
- Additional containers configuration
102+
* - extraInit
103+
- object
104+
- {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}
105+
- Additional configuration for the init container
106+
* - extraInit.pvcStorage
107+
- string
108+
- "50Gi"
109+
- Storage size of the s3
110+
* - extraInit.s3modelpath
111+
- string
112+
- "relative_s3_model_path/opt-125m"
113+
- Path of the model on the s3 which hosts model weights and config files
114+
* - extraInit.awsEc2MetadataDisabled
115+
- boolean
116+
- true
117+
- Disables the use of the Amazon EC2 instance metadata service
118+
* - extraPorts
119+
- list
120+
- []
121+
- Additional ports configuration
122+
* - gpuModels
123+
- list
124+
- ["TYPE_GPU_USED"]
125+
- Type of gpu used
126+
* - image
127+
- object
128+
- {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}
129+
- Image configuration
130+
* - image.command
131+
- list
132+
- ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]
133+
- Container launch command
134+
* - image.repository
135+
- string
136+
- "vllm/vllm-openai"
137+
- Image repository
138+
* - image.tag
139+
- string
140+
- "latest"
141+
- Image tag
142+
* - livenessProbe
143+
- object
144+
- {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}
145+
- Liveness probe configuration
146+
* - livenessProbe.failureThreshold
147+
- int
148+
- 3
149+
- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
150+
* - livenessProbe.httpGet
151+
- object
152+
- {"path":"/health","port":8000}
153+
- Configuration of the Kubelet http request on the server
154+
* - livenessProbe.httpGet.path
155+
- string
156+
- "/health"
157+
- Path to access on the HTTP server
158+
* - livenessProbe.httpGet.port
159+
- int
160+
- 8000
161+
- Name or number of the port to access on the container, on which the server is listening
162+
* - livenessProbe.initialDelaySeconds
163+
- int
164+
- 15
165+
- Number of seconds after the container has started before liveness probe is initiated
166+
* - livenessProbe.periodSeconds
167+
- int
168+
- 10
169+
- How often (in seconds) to perform the liveness probe
170+
* - maxUnavailablePodDisruptionBudget
171+
- string
172+
- ""
173+
- Disruption Budget Configuration
174+
* - readinessProbe
175+
- object
176+
- {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}
177+
- Readiness probe configuration
178+
* - readinessProbe.failureThreshold
179+
- int
180+
- 3
181+
- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
182+
* - readinessProbe.httpGet
183+
- object
184+
- {"path":"/health","port":8000}
185+
- Configuration of the Kubelet http request on the server
186+
* - readinessProbe.httpGet.path
187+
- string
188+
- "/health"
189+
- Path to access on the HTTP server
190+
* - readinessProbe.httpGet.port
191+
- int
192+
- 8000
193+
- Name or number of the port to access on the container, on which the server is listening
194+
* - readinessProbe.initialDelaySeconds
195+
- int
196+
- 5
197+
- Number of seconds after the container has started before readiness probe is initiated
198+
* - readinessProbe.periodSeconds
199+
- int
200+
- 5
201+
- How often (in seconds) to perform the readiness probe
202+
* - replicaCount
203+
- int
204+
- 1
205+
- Number of replicas
206+
* - resources
207+
- object
208+
- {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}
209+
- Resource configuration
210+
* - resources.limits."nvidia.com/gpu"
211+
- int
212+
- 1
213+
- Number of gpus used
214+
* - resources.limits.cpu
215+
- int
216+
- 4
217+
- Number of CPUs
218+
* - resources.limits.memory
219+
- string
220+
- "16Gi"
221+
- CPU memory configuration
222+
* - resources.requests."nvidia.com/gpu"
223+
- int
224+
- 1
225+
- Number of gpus used
226+
* - resources.requests.cpu
227+
- int
228+
- 4
229+
- Number of CPUs
230+
* - resources.requests.memory
231+
- string
232+
- "16Gi"
233+
- CPU memory configuration
234+
* - secrets
235+
- object
236+
- {}
237+
- Secrets configuration
238+
* - serviceName
239+
- string
240+
-
241+
- Service name
242+
* - servicePort
243+
- int
244+
- 80
245+
- Service port
246+
* - labels.environment
247+
- string
248+
- test
249+
- Environment name
250+
* - labels.release
251+
- string
252+
- test
253+
- Release name

examples/chart-helm/.helmignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
*.png
2+
.git/
3+
ct.yaml
4+
lintconf.yaml
5+
values.schema.json
6+
/workflows

examples/chart-helm/Chart.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
apiVersion: v2
2+
name: chart-vllm
3+
description: Chart vllm
4+
5+
# A chart can be either an 'application' or a 'library' chart.
6+
#
7+
# Application charts are a collection of templates that can be packaged into versioned archives
8+
# to be deployed.
9+
#
10+
# Library charts provide useful utilities or functions for the chart developer. They're included as
11+
# a dependency of application charts to inject those utilities and functions into the rendering
12+
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
13+
type: application
14+
15+
# This is the chart version. This version number should be incremented each time you make changes
16+
# to the chart and its templates, including the app version.
17+
# Versions are expected to follow Semantic Versioning (https://semver.org/)
18+
version: 0.0.1
19+
20+
maintainers:
21+
- name: mfournioux

examples/chart-helm/ct.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
chart-dirs:
2+
- charts
3+
validate-maintainers: false

0 commit comments

Comments
 (0)