|
| 1 | +# Serving Your Private Code Llama-70B with API, Chat, and VSCode Access |
| 2 | + |
| 3 | +[Code Llama](https://github.com/facebookresearch/codellama) is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets, sampling more data from that same dataset for longer. On Jan 29th, 2024, Meta released the Code Llama 70B, the largest and best-performing model in the Code Llama family. |
| 4 | + |
| 5 | +The followings are the demos of Code Llama 70B hosted by SkyPilot Serve (aka SkyServe) (see more details about the setup in later sections): |
| 6 | + |
| 7 | +##### Connect to hosted Code Llama with Tabby as a coding assistant in VScode |
| 8 | +<img src="https://imgur.com/fguAmP0.gif" width="60%" /> |
| 9 | + |
| 10 | +##### Connect to hosted Code Llama with FastChat for chatting |
| 11 | +<img src="https://imgur.com/Dor1MoE.gif" width="60%" /> |
| 12 | + |
| 13 | +## References |
| 14 | +* [Llama-2 Example](../../llm/llama-2/) |
| 15 | +* [Code Llama release](https://ai.meta.com/blog/code-llama-large-language-model-coding/) |
| 16 | +* [Code Llama paper](https://arxiv.org/abs/2308.12950) |
| 17 | + |
| 18 | +## Why use SkyPilot to deploy over commercial hosted solutions? |
| 19 | + |
| 20 | +* Get the best GPU availability by utilizing multiple resources pools across multiple regions and clouds. |
| 21 | +* Pay absolute minimum — SkyPilot picks the cheapest resources across regions and clouds. No managed solution markups. |
| 22 | +* Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint |
| 23 | +* Everything stays in your cloud account (your VMs & buckets) |
| 24 | +* Completely private - no one else sees your chat history |
| 25 | + |
| 26 | + |
| 27 | +## Running your own Code Llama with SkyPilot |
| 28 | + |
| 29 | +After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Code Llama on vLLM with SkyPilot in 1-click: |
| 30 | + |
| 31 | +1. Start serving Code Llama 70B on a single instance with any available GPU in the list specified in [endpoint.yaml](endpoint.yaml) with a vLLM powered OpenAI-compatible endpoint: |
| 32 | +```console |
| 33 | +sky launch -c code-llama -s endpoint.yaml |
| 34 | + |
| 35 | +---------------------------------------------------------------------------------------------------------- |
| 36 | +CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN |
| 37 | +---------------------------------------------------------------------------------------------------------- |
| 38 | + Azure Standard_NC48ads_A100_v4 48 440 A100-80GB:2 eastus 7.35 ✔ |
| 39 | + GCP g2-standard-96 96 384 L4:8 us-east4-a 7.98 |
| 40 | + GCP a2-ultragpu-2g 24 340 A100-80GB:2 us-central1-a 10.06 |
| 41 | + Azure Standard_NC96ads_A100_v4 96 880 A100-80GB:4 eastus 14.69 |
| 42 | + GCP a2-highgpu-4g 48 340 A100:4 us-central1-a 14.69 |
| 43 | + AWS g5.48xlarge 192 768 A10G:8 us-east-1 16.29 |
| 44 | + GCP a2-ultragpu-4g 48 680 A100-80GB:4 us-central1-a 20.11 |
| 45 | + Azure Standard_ND96asr_v4 96 900 A100:8 eastus 27.20 |
| 46 | + GCP a2-highgpu-8g 96 680 A100:8 us-central1-a 29.39 |
| 47 | + AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77 |
| 48 | + Azure Standard_ND96amsr_A100_v4 96 1924 A100-80GB:8 eastus 32.77 |
| 49 | + GCP a2-ultragpu-8g 96 1360 A100-80GB:8 us-central1-a 40.22 |
| 50 | + AWS p4de.24xlarge 96 1152 A100-80GB:8 us-east-1 40.97 |
| 51 | +---------------------------------------------------------------------------------------------------------- |
| 52 | + |
| 53 | +Launching a cluster 'code-llama'. Proceed? [Y/n]: |
| 54 | +``` |
| 55 | +2. Send a request to the endpoint for code completion: |
| 56 | +```bash |
| 57 | +IP=$(sky status --ip code-llama) |
| 58 | + |
| 59 | +curl -L http://$IP:8000/v1/completions \ |
| 60 | + -H "Content-Type: application/json" \ |
| 61 | + -d '{ |
| 62 | + "model": "codellama/CodeLlama-70b-Instruct-hf", |
| 63 | + "prompt": "def quick_sort(a: List[int]):", |
| 64 | + "max_tokens": 512 |
| 65 | + }' | jq -r '.choices[0].text' |
| 66 | +``` |
| 67 | + |
| 68 | +This returns the following completion: |
| 69 | +```python |
| 70 | + if len(a) <= 1: |
| 71 | + return a |
| 72 | + pivot = a.pop(len(a)//2) |
| 73 | + b = [] |
| 74 | + c = [] |
| 75 | + for i in a: |
| 76 | + if i > pivot: |
| 77 | + b.append(i) |
| 78 | + else: |
| 79 | + c.append(i) |
| 80 | + b = quick_sort(b) |
| 81 | + c = quick_sort(c) |
| 82 | + res = [] |
| 83 | + res.extend(c) |
| 84 | + res.append(pivot) |
| 85 | + res.extend(b) |
| 86 | + return res |
| 87 | +``` |
| 88 | + |
| 89 | +## Scale up the service with SkyServe |
| 90 | + |
| 91 | +1. With [SkyServe](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html), a serving library built on top of SkyPilot, scaling up the Code Llama service is as simple as running: |
| 92 | +```bash |
| 93 | +sky serve up -n code-llama ./endpoint.yaml |
| 94 | +``` |
| 95 | +This will start the service with multiple replicas on the cheapest available locations and accelerators. SkyServe will automatically manage the replicas, monitor their health, autoscale based on load, and restart them when needed. |
| 96 | + |
| 97 | +A single endpoint will be returned and any request sent to the endpoint will be routed to the ready replicas. |
| 98 | + |
| 99 | +2. To check the status of the service, run: |
| 100 | +```bash |
| 101 | +sky serve status code-llama |
| 102 | +``` |
| 103 | +After a while, you will see the following output: |
| 104 | +```console |
| 105 | +Services |
| 106 | +NAME VERSION UPTIME STATUS REPLICAS ENDPOINT |
| 107 | +code-llama 1 - READY 2/2 3.85.107.228:30002 |
| 108 | + |
| 109 | +Service Replicas |
| 110 | +SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION |
| 111 | +code-llama 1 1 - 2 mins ago 1x Azure({'A100-80GB': 2}) READY eastus |
| 112 | +code-llama 2 1 - 2 mins ago 1x GCP({'L4': 8}) READY us-east4-a |
| 113 | +``` |
| 114 | +As shown, the service is now backed by 2 replicas, one on Azure and one on GCP, and the accelerator |
| 115 | +type is chosen to be **the cheapest available one** on the clouds. That said, it maximizes the |
| 116 | +availability of the service while minimizing the cost. |
| 117 | + |
| 118 | +3. To access the model, we use the same curl command to send the request to the endpoint: |
| 119 | +```bash |
| 120 | +ENDPOINT=$(sky serve status --endpoint code-llama) |
| 121 | + |
| 122 | +curl -L http://$ENDPOINT/v1/completions \ |
| 123 | + -H "Content-Type: application/json" \ |
| 124 | + -d '{ |
| 125 | + "model": "codellama/CodeLlama-70b-Instruct-hf", |
| 126 | + "prompt": "def quick_sort(a: List[int]):", |
| 127 | + "max_tokens": 512 |
| 128 | + }' | jq -r '.choices[0].text' |
| 129 | +``` |
| 130 | + |
| 131 | +## **Optional:** Accessing Code Llama with Chat API |
| 132 | + |
| 133 | +We can also access the Code Llama service with the openAI Chat API. |
| 134 | +```bash |
| 135 | +ENDPOINT=$(sky serve status --endpoint code-llama) |
| 136 | + |
| 137 | +curl -L http://$ENDPOINT/v1/chat/completions \ |
| 138 | + -H "Content-Type: application/json" \ |
| 139 | + -d '{ |
| 140 | + "model": "codellama/CodeLlama-70b-Instruct-hf", |
| 141 | + "messages": [ |
| 142 | + { |
| 143 | + "role": "system", |
| 144 | + "content": "You are a helpful and honest code assistant expert in Python." |
| 145 | + }, |
| 146 | + { |
| 147 | + "role": "user", |
| 148 | + "content": "Show me the python code for quick sorting a list of integers." |
| 149 | + } |
| 150 | + ], |
| 151 | + "max_tokens": 512 |
| 152 | + }' | jq -r '.choices[0].message.content' |
| 153 | +``` |
| 154 | + |
| 155 | +You can see something similar as below: |
| 156 | +```````` |
| 157 | +```python |
| 158 | +def quicksort(arr): |
| 159 | + if len(arr) <= 1: |
| 160 | + return arr |
| 161 | + pivot = arr[len(arr) // 2] |
| 162 | + left = [x for x in arr if x < pivot] |
| 163 | + middle = [x for x in arr if x == pivot] |
| 164 | + right = [x for x in arr if x > pivot] |
| 165 | + return quicksort(left) + middle + quicksort(right) |
| 166 | +
|
| 167 | +# Example usage: |
| 168 | +numbers = [10, 2, 44, 15, 30, 11, 50] |
| 169 | +sorted_numbers = quicksort(numbers) |
| 170 | +print(sorted_numbers) |
| 171 | +``` |
| 172 | +
|
| 173 | +This code defines a function `quicksort` that takes a list of integers as input. It divides the list into three parts based on the pivot element, which is the middle element of the list. It then recursively sorts the left and right partitions and combines them with the middle partition. |
| 174 | +```````` |
| 175 | + |
| 176 | +Alternatively, we could access the model through python with OpenAI's API (see [complete.py](complete.py)): |
| 177 | +```bash |
| 178 | +python complete.py |
| 179 | +``` |
| 180 | + |
| 181 | +## **Optional:** Accessing Code Llama with Chat GUI |
| 182 | + |
| 183 | +It is also possible to access the Code Llama service with a GUI using [FastChat](https://github.com/lm-sys/FastChat). Please check the [demo](#connect-to-hosted-code-llama-with-fastchat-for-chatting). |
| 184 | + |
| 185 | +1. Start the chat web UI: |
| 186 | +```bash |
| 187 | +sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama) |
| 188 | +``` |
| 189 | + |
| 190 | +2. Then, we can access the GUI at the returned gradio link: |
| 191 | +``` |
| 192 | +| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live |
| 193 | +``` |
| 194 | + |
| 195 | +Note that you may get better results to use a higher temperature and top_p value. |
| 196 | + |
| 197 | + |
| 198 | +## **Optional:** Using Code Llama as Coding Assistant in VScode |
| 199 | + |
| 200 | +[Tabby](https://tabby.tabbyml.com/) is an open-source, self-hosted AI coding assistant. It allows you to connect |
| 201 | +to your own AI models and use them as a coding assistant in VScode. Please check the [demo](#connect-to-hosted-code-llama-with-tabby-as-a-coding-assistant-in-vscode) at the top. |
| 202 | + |
| 203 | +To start a Tabby server that connects to the Code Llama service, run: |
| 204 | +```bash |
| 205 | +sky launch -c tabby ./tabby.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama) |
| 206 | +``` |
| 207 | + |
| 208 | +To get the endpoint for Tabby server, run: |
| 209 | +```bash |
| 210 | +IP=$(sky status --ip tabby) |
| 211 | +echo Endpoint: http://$IP:8080 |
| 212 | +``` |
| 213 | + |
| 214 | +Then, you can connect to the Tabby server from VScode by installing the [Tabby extension](https://marketplace.visualstudio.com/items?itemName=tabby-ai.tabby-vscode). |
| 215 | + |
| 216 | +> Note that Code Llama 70B does not have the full infiling functionality [[1](https://huggingface.co/codellama/CodeLlama-70b-Instruct-hf)], so the performance of Tabby with Code Llama may be limited. |
| 217 | +> |
| 218 | +> To get infiling functionality, you can use the smaller Code Llama models, e.g., Code Llama [7B](https://huggingface.co/codellama/CodeLlama-13B-Instruct-hf) and [13B](https://huggingface.co/codellama/CodeLlama-13B-Instruct-hf), and replace `prompt_template` with `"<|fim▁begin|>{prefix}<|fim▁hole|>{suffix}<|fim▁end|>"` in the [yaml](./tabby.yaml) or the command above. |
| 219 | +> |
| 220 | +> For better performance, we recommend using Tabby with the recommended models in the [Tabby documentation](https://tabby.tabbyml.com/docs/models/) and our [Tabby example](../tabby). |
| 221 | +
|
0 commit comments