Skip to content

Commit b25471c

Browse files
MichaelvllromilbhardwajMaoZiming
authored
[LLM] Add code llama example (#3050)
* Add code llama example * Add update and python API * use endpoint and fix ux for serve update * Add gui * adopt changes from comments * fix gui * fix gui * use API server directly for GUI * Move web GUI gif to the top * new line * Credit for fastchat * use instruct model instead * Update llm/codellama/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update llm/codellama/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update llm/codellama/README.md Co-authored-by: Ziming Mao <[email protected]> * Update llm/codellama/README.md Co-authored-by: Ziming Mao <[email protected]> * Update llm/codellama/complete.py Co-authored-by: Ziming Mao <[email protected]> * Update llm/codellama/README.md Co-authored-by: Ziming Mao <[email protected]> * Update llm/codellama/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update llm/codellama/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update llm/codellama/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update llm/codellama/gui.yaml Co-authored-by: Romil Bhardwaj <[email protected]> * Add 70B and SkyServe * Add comments * Fix issues in the instruction * dependency fixes for GUI * simplify replica config * fix indents * fix * use 2 replicas * add news * recycle llama 2 chatbot * Add tabby example * Image size * smaller size * Fix readme * Smaller title * shorten title * shorten * fix * paras * remove local * rename * typo * title * Increase initial delay * change to skypilot serve * title * new title --------- Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Ziming Mao <[email protected]>
1 parent 54e5bb0 commit b25471c

File tree

9 files changed

+372
-4
lines changed

9 files changed

+372
-4
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -27,13 +27,13 @@
2727

2828
----
2929
:fire: *News* :fire:
30+
- [Jan, 2024]: Example: Serving [Code Llama 70B](https://ai.meta.com/blog/code-llama-large-language-model-coding/) with vLLM and SkyServe: [**example**](./llm/codellama/)
3031
- [Dec, 2023] Example: Using [LoRAX](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
3132
- [Dec, 2023] [**Mixtral 8x7B**](https://mistral.ai/news/mixtral-of-experts/), a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**example**](./llm/mixtral/).
3233
- [Nov, 2023] Example: Using [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/)
3334
- [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot)
3435
- [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/)
3536
- [Aug, 2023] Cookbook: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/)
36-
- [July, 2023] Self-Hosted **Llama-2 Chatbot** on Any Cloud: [**example**](./llm/llama-2/)
3737
- [June, 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/)
3838
- [April, 2023] [SkyPilot YAMLs](./llm/vicuna/) for finetuning & serving the [Vicuna LLM](https://lmsys.org/blog/2023-03-30-vicuna/) with a single command!
3939
----

examples/serve/vllm.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,6 @@ run: |
3838
3939
python3 -m fastchat.serve.controller --host 0.0.0.0 --port ${CONTROLLER_PORT} > ~/controller.log 2>&1 &
4040
41-
cd FastChat
4241
python3 -m fastchat.serve.vllm_worker \
4342
--model-path lmsys/vicuna-7b-v1.5 \
4443
--controller-address http://${WORKER_IP}:${CONTROLLER_PORT} \

llm/codellama/README.md

+221
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
# Serving Your Private Code Llama-70B with API, Chat, and VSCode Access
2+
3+
[Code Llama](https://github.com/facebookresearch/codellama) is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets, sampling more data from that same dataset for longer. On Jan 29th, 2024, Meta released the Code Llama 70B, the largest and best-performing model in the Code Llama family.
4+
5+
The followings are the demos of Code Llama 70B hosted by SkyPilot Serve (aka SkyServe) (see more details about the setup in later sections):
6+
7+
##### Connect to hosted Code Llama with Tabby as a coding assistant in VScode
8+
<img src="https://imgur.com/fguAmP0.gif" width="60%" />
9+
10+
##### Connect to hosted Code Llama with FastChat for chatting
11+
<img src="https://imgur.com/Dor1MoE.gif" width="60%" />
12+
13+
## References
14+
* [Llama-2 Example](../../llm/llama-2/)
15+
* [Code Llama release](https://ai.meta.com/blog/code-llama-large-language-model-coding/)
16+
* [Code Llama paper](https://arxiv.org/abs/2308.12950)
17+
18+
## Why use SkyPilot to deploy over commercial hosted solutions?
19+
20+
* Get the best GPU availability by utilizing multiple resources pools across multiple regions and clouds.
21+
* Pay absolute minimum — SkyPilot picks the cheapest resources across regions and clouds. No managed solution markups.
22+
* Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint
23+
* Everything stays in your cloud account (your VMs & buckets)
24+
* Completely private - no one else sees your chat history
25+
26+
27+
## Running your own Code Llama with SkyPilot
28+
29+
After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Code Llama on vLLM with SkyPilot in 1-click:
30+
31+
1. Start serving Code Llama 70B on a single instance with any available GPU in the list specified in [endpoint.yaml](endpoint.yaml) with a vLLM powered OpenAI-compatible endpoint:
32+
```console
33+
sky launch -c code-llama -s endpoint.yaml
34+
35+
----------------------------------------------------------------------------------------------------------
36+
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
37+
----------------------------------------------------------------------------------------------------------
38+
Azure Standard_NC48ads_A100_v4 48 440 A100-80GB:2 eastus 7.35 ✔
39+
GCP g2-standard-96 96 384 L4:8 us-east4-a 7.98
40+
GCP a2-ultragpu-2g 24 340 A100-80GB:2 us-central1-a 10.06
41+
Azure Standard_NC96ads_A100_v4 96 880 A100-80GB:4 eastus 14.69
42+
GCP a2-highgpu-4g 48 340 A100:4 us-central1-a 14.69
43+
AWS g5.48xlarge 192 768 A10G:8 us-east-1 16.29
44+
GCP a2-ultragpu-4g 48 680 A100-80GB:4 us-central1-a 20.11
45+
Azure Standard_ND96asr_v4 96 900 A100:8 eastus 27.20
46+
GCP a2-highgpu-8g 96 680 A100:8 us-central1-a 29.39
47+
AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77
48+
Azure Standard_ND96amsr_A100_v4 96 1924 A100-80GB:8 eastus 32.77
49+
GCP a2-ultragpu-8g 96 1360 A100-80GB:8 us-central1-a 40.22
50+
AWS p4de.24xlarge 96 1152 A100-80GB:8 us-east-1 40.97
51+
----------------------------------------------------------------------------------------------------------
52+
53+
Launching a cluster 'code-llama'. Proceed? [Y/n]:
54+
```
55+
2. Send a request to the endpoint for code completion:
56+
```bash
57+
IP=$(sky status --ip code-llama)
58+
59+
curl -L http://$IP:8000/v1/completions \
60+
-H "Content-Type: application/json" \
61+
-d '{
62+
"model": "codellama/CodeLlama-70b-Instruct-hf",
63+
"prompt": "def quick_sort(a: List[int]):",
64+
"max_tokens": 512
65+
}' | jq -r '.choices[0].text'
66+
```
67+
68+
This returns the following completion:
69+
```python
70+
if len(a) <= 1:
71+
return a
72+
pivot = a.pop(len(a)//2)
73+
b = []
74+
c = []
75+
for i in a:
76+
if i > pivot:
77+
b.append(i)
78+
else:
79+
c.append(i)
80+
b = quick_sort(b)
81+
c = quick_sort(c)
82+
res = []
83+
res.extend(c)
84+
res.append(pivot)
85+
res.extend(b)
86+
return res
87+
```
88+
89+
## Scale up the service with SkyServe
90+
91+
1. With [SkyServe](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html), a serving library built on top of SkyPilot, scaling up the Code Llama service is as simple as running:
92+
```bash
93+
sky serve up -n code-llama ./endpoint.yaml
94+
```
95+
This will start the service with multiple replicas on the cheapest available locations and accelerators. SkyServe will automatically manage the replicas, monitor their health, autoscale based on load, and restart them when needed.
96+
97+
A single endpoint will be returned and any request sent to the endpoint will be routed to the ready replicas.
98+
99+
2. To check the status of the service, run:
100+
```bash
101+
sky serve status code-llama
102+
```
103+
After a while, you will see the following output:
104+
```console
105+
Services
106+
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
107+
code-llama 1 - READY 2/2 3.85.107.228:30002
108+
109+
Service Replicas
110+
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
111+
code-llama 1 1 - 2 mins ago 1x Azure({'A100-80GB': 2}) READY eastus
112+
code-llama 2 1 - 2 mins ago 1x GCP({'L4': 8}) READY us-east4-a
113+
```
114+
As shown, the service is now backed by 2 replicas, one on Azure and one on GCP, and the accelerator
115+
type is chosen to be **the cheapest available one** on the clouds. That said, it maximizes the
116+
availability of the service while minimizing the cost.
117+
118+
3. To access the model, we use the same curl command to send the request to the endpoint:
119+
```bash
120+
ENDPOINT=$(sky serve status --endpoint code-llama)
121+
122+
curl -L http://$ENDPOINT/v1/completions \
123+
-H "Content-Type: application/json" \
124+
-d '{
125+
"model": "codellama/CodeLlama-70b-Instruct-hf",
126+
"prompt": "def quick_sort(a: List[int]):",
127+
"max_tokens": 512
128+
}' | jq -r '.choices[0].text'
129+
```
130+
131+
## **Optional:** Accessing Code Llama with Chat API
132+
133+
We can also access the Code Llama service with the openAI Chat API.
134+
```bash
135+
ENDPOINT=$(sky serve status --endpoint code-llama)
136+
137+
curl -L http://$ENDPOINT/v1/chat/completions \
138+
-H "Content-Type: application/json" \
139+
-d '{
140+
"model": "codellama/CodeLlama-70b-Instruct-hf",
141+
"messages": [
142+
{
143+
"role": "system",
144+
"content": "You are a helpful and honest code assistant expert in Python."
145+
},
146+
{
147+
"role": "user",
148+
"content": "Show me the python code for quick sorting a list of integers."
149+
}
150+
],
151+
"max_tokens": 512
152+
}' | jq -r '.choices[0].message.content'
153+
```
154+
155+
You can see something similar as below:
156+
````````
157+
```python
158+
def quicksort(arr):
159+
if len(arr) <= 1:
160+
return arr
161+
pivot = arr[len(arr) // 2]
162+
left = [x for x in arr if x < pivot]
163+
middle = [x for x in arr if x == pivot]
164+
right = [x for x in arr if x > pivot]
165+
return quicksort(left) + middle + quicksort(right)
166+
167+
# Example usage:
168+
numbers = [10, 2, 44, 15, 30, 11, 50]
169+
sorted_numbers = quicksort(numbers)
170+
print(sorted_numbers)
171+
```
172+
173+
This code defines a function `quicksort` that takes a list of integers as input. It divides the list into three parts based on the pivot element, which is the middle element of the list. It then recursively sorts the left and right partitions and combines them with the middle partition.
174+
````````
175+
176+
Alternatively, we could access the model through python with OpenAI's API (see [complete.py](complete.py)):
177+
```bash
178+
python complete.py
179+
```
180+
181+
## **Optional:** Accessing Code Llama with Chat GUI
182+
183+
It is also possible to access the Code Llama service with a GUI using [FastChat](https://github.com/lm-sys/FastChat). Please check the [demo](#connect-to-hosted-code-llama-with-fastchat-for-chatting).
184+
185+
1. Start the chat web UI:
186+
```bash
187+
sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)
188+
```
189+
190+
2. Then, we can access the GUI at the returned gradio link:
191+
```
192+
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
193+
```
194+
195+
Note that you may get better results to use a higher temperature and top_p value.
196+
197+
198+
## **Optional:** Using Code Llama as Coding Assistant in VScode
199+
200+
[Tabby](https://tabby.tabbyml.com/) is an open-source, self-hosted AI coding assistant. It allows you to connect
201+
to your own AI models and use them as a coding assistant in VScode. Please check the [demo](#connect-to-hosted-code-llama-with-tabby-as-a-coding-assistant-in-vscode) at the top.
202+
203+
To start a Tabby server that connects to the Code Llama service, run:
204+
```bash
205+
sky launch -c tabby ./tabby.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)
206+
```
207+
208+
To get the endpoint for Tabby server, run:
209+
```bash
210+
IP=$(sky status --ip tabby)
211+
echo Endpoint: http://$IP:8080
212+
```
213+
214+
Then, you can connect to the Tabby server from VScode by installing the [Tabby extension](https://marketplace.visualstudio.com/items?itemName=tabby-ai.tabby-vscode).
215+
216+
> Note that Code Llama 70B does not have the full infiling functionality [[1](https://huggingface.co/codellama/CodeLlama-70b-Instruct-hf)], so the performance of Tabby with Code Llama may be limited.
217+
>
218+
> To get infiling functionality, you can use the smaller Code Llama models, e.g., Code Llama [7B](https://huggingface.co/codellama/CodeLlama-13B-Instruct-hf) and [13B](https://huggingface.co/codellama/CodeLlama-13B-Instruct-hf), and replace `prompt_template` with `"<|fim▁begin|>{prefix}<|fim▁hole|>{suffix}<|fim▁end|>"` in the [yaml](./tabby.yaml) or the command above.
219+
>
220+
> For better performance, we recommend using Tabby with the recommended models in the [Tabby documentation](https://tabby.tabbyml.com/docs/models/) and our [Tabby example](../tabby).
221+

llm/codellama/complete.py

+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
import openai
2+
3+
import sky
4+
from sky import serve
5+
6+
service_records = sky.serve.status('code-llama')
7+
endpoint = serve.get_endpoint(service_records[0])
8+
9+
print('Using endpoint:', endpoint)
10+
11+
client = openai.OpenAI(
12+
base_url=f'http://{endpoint}/v1',
13+
# No API key is required when self-hosted.
14+
api_key='EMPTY')
15+
16+
chat_completion = client.chat.completions.create(
17+
model='codellama/CodeLlama-70b-Instruct-hf',
18+
messages=[{
19+
'role': 'system',
20+
'content': 'You are a helpful and honest code assistant expert in Python.'
21+
}, {
22+
'role': 'user',
23+
'content': 'Show me the code for quick sort a list of integers.'
24+
}],
25+
max_tokens=300,
26+
)
27+
28+
print(chat_completion.model_dump())

llm/codellama/endpoint.yaml

+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# An example yaml for serving Code Llama model from Meta with an OpenAI API.
2+
# Usage:
3+
# 1. Launch on a single instance: `sky launch -c code-llama ./endpoint.yaml`
4+
# 2. Scale up to multiple replicas with a single endpoint:
5+
# `sky serve up -n code-llama ./endpoint.yaml`
6+
service:
7+
readiness_probe:
8+
path: /v1/completions
9+
post_data:
10+
model: codellama/CodeLlama-70b-Instruct-hf
11+
prompt: "def hello_world():"
12+
max_tokens: 1
13+
initial_delay_seconds: 1800
14+
replicas: 2
15+
16+
resources:
17+
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
18+
disk_size: 1024
19+
disk_tier: best
20+
memory: 32+
21+
ports: 8000
22+
23+
setup: |
24+
conda activate codellama
25+
if [ $? -ne 0 ]; then
26+
conda create -n codellama python=3.10 -y
27+
conda activate codellama
28+
fi
29+
30+
# We have to manually install Torch otherwise apex & xformers won't build
31+
pip list | grep torch || pip install "torch>=2.0.0"
32+
33+
pip list | grep vllm || pip install "git+https://github.com/vllm-project/vllm.git"
34+
pip install git+https://github.com/huggingface/transformers
35+
36+
run: |
37+
conda activate codellama
38+
export PATH=$PATH:/sbin
39+
# Reduce --max-num-seqs to avoid OOM during loading model on L4:8
40+
python -u -m vllm.entrypoints.openai.api_server \
41+
--host 0.0.0.0 \
42+
--model codellama/CodeLlama-70b-Instruct-hf \
43+
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
44+
--max-num-seqs 64 | tee ~/openai_api_server.log
45+

llm/codellama/gui.yaml

+50
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Starts a GUI server that connects to the Code Llama OpenAI API server.
2+
# This works with the endpoint.yaml, please refer to llm/codellama/README.md
3+
# for more details.
4+
# Usage:
5+
# 1. If you have a endpoint started on a cluster (sky launch):
6+
# `sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky status --ip code-llama):8000`
7+
# 2. If you have a SkyPilot Service started (sky serve up) called code-llama:
8+
# `sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)`
9+
# After the GUI server is started, you will see a gradio link in the output and
10+
# you can click on it to open the GUI.
11+
12+
envs:
13+
ENDPOINT: x.x.x.x:3031 # Address of the API server running codellama.
14+
15+
resources:
16+
cpus: 2
17+
18+
setup: |
19+
conda activate codellama
20+
if [ $? -ne 0 ]; then
21+
conda create -n codellama python=3.10 -y
22+
conda activate codellama
23+
fi
24+
25+
pip install "fschat[model_worker,webui]"
26+
pip install "openai<1"
27+
28+
run: |
29+
conda activate codellama
30+
export PATH=$PATH:/sbin
31+
WORKER_IP=$(hostname -I | cut -d' ' -f1)
32+
CONTROLLER_PORT=21001
33+
WORKER_PORT=21002
34+
35+
cat <<EOF > ~/model_info.json
36+
{
37+
"codellama/CodeLlama-70b-Instruct-hf": {
38+
"model_name": "codellama/CodeLlama-70b-Instruct-hf",
39+
"api_base": "http://${ENDPOINT}/v1",
40+
"api_key": "empty",
41+
"model_path": "codellama/CodeLlama-70b-Instruct-hf"
42+
}
43+
}
44+
EOF
45+
46+
python3 -m fastchat.serve.controller --host 0.0.0.0 --port ${CONTROLLER_PORT} > ~/controller.log 2>&1 &
47+
48+
echo 'Starting gradio server...'
49+
python -u -m fastchat.serve.gradio_web_server --share \
50+
--register-openai-compatible-models ~/model_info.json | tee ~/gradio.log

llm/codellama/tabby.yaml

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Starts a Tabby server that connects to the Code Llama OpenAI API server.
2+
# This works with the endpoint.yaml, please refer to llm/codellama/README.md
3+
# for more details.
4+
# Usage:
5+
# 1. If you have a endpoint started on a cluster (sky launch):
6+
# `sky launch -c tabby ./tabby.yaml --env ENDPOINT=$(sky status --ip code-llama):8000`
7+
# 2. If you have a SkyPilot Service started (sky serve up) called code-llama:
8+
# `sky launch -c tabby ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)`
9+
# After the Tabby server is started, you can add the endpoint (URL:port) to VSCode
10+
# Tabby extension and start using it.
11+
12+
envs:
13+
ENDPOINT: x.x.x.x:3031 # Address of the API server running codellama.
14+
15+
resources:
16+
cpus: 2
17+
ports: 8080
18+
19+
setup: |
20+
wget https://github.com/TabbyML/tabby/releases/download/v0.8.0-rc.1/tabby_x86_64-manylinux2014 -O tabby
21+
chmod +x tabby
22+
23+
run: |
24+
./tabby serve --device experimental-http \
25+
--model "{\"kind\": \"openai\", \"model_name\": \"codellama/CodeLlama-70b-Instruct-hf\", \"api_endpoint\": \"http://$ENDPOINT/v1/completions\", \"prompt_template\": \"{prefix}\"}"

0 commit comments

Comments
 (0)