[LLM] Add code llama example (#3050)

Michaelvll · romilbhardwaj · MaoZiming · web-flow · commit b25471cf40d8 · 2024-02-01T08:21:04.000-08:00
* Add code llama example

* Add update and python API

* use endpoint and fix ux for serve update

* Add gui

* adopt changes from comments

* fix gui

* fix gui

* use API server directly for GUI

* Move web GUI gif to the top

* new line

* Credit for fastchat

* use instruct model instead

* Update llm/codellama/README.md

Co-authored-by: Romil Bhardwaj &lt;romil.bhardwaj@berkeley.edu&gt;

* Update llm/codellama/README.md

Co-authored-by: Romil Bhardwaj &lt;romil.bhardwaj@berkeley.edu&gt;

* Update llm/codellama/README.md

Co-authored-by: Ziming Mao &lt;ziming.mao@yale.edu&gt;

* Update llm/codellama/README.md

Co-authored-by: Ziming Mao &lt;ziming.mao@yale.edu&gt;

* Update llm/codellama/complete.py

Co-authored-by: Ziming Mao &lt;ziming.mao@yale.edu&gt;

* Update llm/codellama/README.md

Co-authored-by: Ziming Mao &lt;ziming.mao@yale.edu&gt;

* Update llm/codellama/README.md

Co-authored-by: Romil Bhardwaj &lt;romil.bhardwaj@berkeley.edu&gt;

* Update llm/codellama/README.md

Co-authored-by: Romil Bhardwaj &lt;romil.bhardwaj@berkeley.edu&gt;

* Update llm/codellama/README.md

Co-authored-by: Romil Bhardwaj &lt;romil.bhardwaj@berkeley.edu&gt;

* Update llm/codellama/gui.yaml

Co-authored-by: Romil Bhardwaj &lt;romil.bhardwaj@berkeley.edu&gt;

* Add 70B and SkyServe

* Add comments

* Fix issues in the instruction

* dependency fixes for GUI

* simplify replica config

* fix indents

* fix

* use 2 replicas

* add news

* recycle llama 2 chatbot

* Add tabby example

* Image size

* smaller size

* Fix readme

* Smaller title

* shorten title

* shorten

* fix

* paras

* remove local

* rename

* typo

* title

* Increase initial delay

* change to skypilot serve

* title

* new title

---------

Co-authored-by: Romil Bhardwaj &lt;romil.bhardwaj@berkeley.edu&gt;
Co-authored-by: Ziming Mao &lt;ziming.mao@yale.edu&gt;
diff --git a/README.md b/README.md
@@ -27,13 +27,13 @@
 
 ----
 :fire: *News* :fire:
+- [Jan, 2024]: Example: Serving [Code Llama 70B](https://ai.meta.com/blog/code-llama-large-language-model-coding/) with vLLM and SkyServe: [**example**](./llm/codellama/)
 - [Dec, 2023] Example: Using [LoRAX](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
 - [Dec, 2023] [**Mixtral 8x7B**](https://mistral.ai/news/mixtral-of-experts/), a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**example**](./llm/mixtral/).
 - [Nov, 2023] Example: Using [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/)
 - [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot)
 - [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/)
 - [Aug, 2023] Cookbook: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/)
-- [July, 2023] Self-Hosted **Llama-2 Chatbot** on Any Cloud: [**example**](./llm/llama-2/)
 - [June, 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/)
 - [April, 2023] [SkyPilot YAMLs](./llm/vicuna/) for finetuning & serving the [Vicuna LLM](https://lmsys.org/blog/2023-03-30-vicuna/) with a single command!
 ----
diff --git a/examples/serve/vllm.yaml b/examples/serve/vllm.yaml
@@ -38,7 +38,6 @@ run: |
 
   python3 -m fastchat.serve.controller --host 0.0.0.0 --port ${CONTROLLER_PORT} > ~/controller.log 2>&1 &
 
-  cd FastChat
   python3 -m fastchat.serve.vllm_worker \
     --model-path lmsys/vicuna-7b-v1.5 \
     --controller-address http://${WORKER_IP}:${CONTROLLER_PORT} \
diff --git a/llm/codellama/README.md b/llm/codellama/README.md
@@ -0,0 +1,221 @@
+# Serving Your Private Code Llama-70B with API, Chat, and VSCode Access
+
+[Code Llama](https://github.com/facebookresearch/codellama) is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets, sampling more data from that same dataset for longer. On Jan 29th, 2024, Meta released the Code Llama 70B, the largest and best-performing model in the Code Llama family.
+
+The followings are the demos of Code Llama 70B hosted by SkyPilot Serve (aka SkyServe) (see more details about the setup in later sections):
+
+##### Connect to hosted Code Llama with Tabby as a coding assistant in VScode
+<img src="https://imgur.com/fguAmP0.gif" width="60%" />
+
+##### Connect to hosted Code Llama with FastChat for chatting
+<img src="https://imgur.com/Dor1MoE.gif" width="60%" /> 
+
+## References
+* [Llama-2 Example](../../llm/llama-2/)
+* [Code Llama release](https://ai.meta.com/blog/code-llama-large-language-model-coding/)
+* [Code Llama paper](https://arxiv.org/abs/2308.12950)
+
+## Why use SkyPilot to deploy over commercial hosted solutions?
+
+* Get the best GPU availability by utilizing multiple resources pools across multiple regions and clouds.
+* Pay absolute minimum — SkyPilot picks the cheapest resources across regions and clouds. No managed solution markups.
+* Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint 
+* Everything stays in your cloud account (your VMs & buckets)
+* Completely private - no one else sees your chat history
+
+
+## Running your own Code Llama with SkyPilot
+
+After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Code Llama on vLLM with SkyPilot in 1-click:
+
+1. Start serving Code Llama 70B on a single instance with any available GPU in the list specified in [endpoint.yaml](endpoint.yaml) with a vLLM powered OpenAI-compatible endpoint:
+```console
+sky launch -c code-llama -s endpoint.yaml
+
+----------------------------------------------------------------------------------------------------------
+CLOUD   INSTANCE                    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN   
+----------------------------------------------------------------------------------------------------------
+ Azure   Standard_NC48ads_A100_v4    48      440       A100-80GB:2    eastus          7.35          ✔     
+ GCP     g2-standard-96              96      384       L4:8           us-east4-a      7.98                
+ GCP     a2-ultragpu-2g              24      340       A100-80GB:2    us-central1-a   10.06               
+ Azure   Standard_NC96ads_A100_v4    96      880       A100-80GB:4    eastus          14.69               
+ GCP     a2-highgpu-4g               48      340       A100:4         us-central1-a   14.69               
+ AWS     g5.48xlarge                 192     768       A10G:8         us-east-1       16.29               
+ GCP     a2-ultragpu-4g              48      680       A100-80GB:4    us-central1-a   20.11               
+ Azure   Standard_ND96asr_v4         96      900       A100:8         eastus          27.20               
+ GCP     a2-highgpu-8g               96      680       A100:8         us-central1-a   29.39               
+ AWS     p4d.24xlarge                96      1152      A100:8         us-east-1       32.77               
+ Azure   Standard_ND96amsr_A100_v4   96      1924      A100-80GB:8    eastus          32.77               
+ GCP     a2-ultragpu-8g              96      1360      A100-80GB:8    us-central1-a   40.22               
+ AWS     p4de.24xlarge               96      1152      A100-80GB:8    us-east-1       40.97               
+----------------------------------------------------------------------------------------------------------
+
+Launching a cluster 'code-llama'. Proceed? [Y/n]: 
+```
+2. Send a request to the endpoint for code completion:
+```bash
+IP=$(sky status --ip code-llama)
+
+curl -L http://$IP:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+      "model": "codellama/CodeLlama-70b-Instruct-hf",
+      "prompt": "def quick_sort(a: List[int]):",
+      "max_tokens": 512
+  }' | jq -r '.choices[0].text'
+```
+
+This returns the following completion:
+```python
+    if len(a) <= 1:
+        return a
+    pivot = a.pop(len(a)//2)
+    b = []
+    c = []
+    for i in a:
+        if i > pivot:
+            b.append(i)
+        else:
+            c.append(i)
+    b = quick_sort(b)
+    c = quick_sort(c)
+    res = []
+    res.extend(c)
+    res.append(pivot)
+    res.extend(b)
+    return res
+```
+
+## Scale up the service with SkyServe
+
+1. With [SkyServe](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html), a serving library built on top of SkyPilot, scaling up the Code Llama service is as simple as running:
+```bash
+sky serve up -n code-llama ./endpoint.yaml
+```
+This will start the service with multiple replicas on the cheapest available locations and accelerators. SkyServe will automatically manage the replicas, monitor their health, autoscale based on load, and restart them when needed.
+
+A single endpoint will be returned and any request sent to the endpoint will be routed to the ready replicas.
+
+2. To check the status of the service, run:
+```bash
+sky serve status code-llama
+```
+After a while, you will see the following output:
+```console
+Services
+NAME        VERSION  UPTIME  STATUS        REPLICAS  ENDPOINT            
+code-llama  1        -       READY         2/2       3.85.107.228:30002  
+
+Service Replicas
+SERVICE_NAME  ID  VERSION  IP  LAUNCHED    RESOURCES                   STATUS REGION  
+code-llama    1   1        -   2 mins ago  1x Azure({'A100-80GB': 2}) READY  eastus  
+code-llama    2   1        -   2 mins ago  1x GCP({'L4': 8})          READY  us-east4-a 
+```
+As shown, the service is now backed by 2 replicas, one on Azure and one on GCP, and the accelerator
+type is chosen to be **the cheapest available one** on the clouds. That said, it maximizes the
+availability of the service while minimizing the cost.
+
+3. To access the model, we use the same curl command to send the request to the endpoint:
+```bash
+ENDPOINT=$(sky serve status --endpoint code-llama)
+
+curl -L http://$ENDPOINT/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+      "model": "codellama/CodeLlama-70b-Instruct-hf",
+      "prompt": "def quick_sort(a: List[int]):",
+      "max_tokens": 512
+  }' | jq -r '.choices[0].text'
+```
+
+## **Optional:** Accessing Code Llama with Chat API
+
+We can also access the Code Llama service with the openAI Chat API.
+```bash
+ENDPOINT=$(sky serve status --endpoint code-llama)
+
+curl -L http://$ENDPOINT/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+      "model": "codellama/CodeLlama-70b-Instruct-hf",
+      "messages": [
+        {
+          "role": "system",
+          "content": "You are a helpful and honest code assistant expert in Python."
+        },
+        {
+          "role": "user",
+          "content": "Show me the python code for quick sorting a list of integers."
+        }
+      ],
+      "max_tokens": 512
+  }' | jq -r '.choices[0].message.content'
+```
+
+You can see something similar as below:
+````````
+```python
+def quicksort(arr):
+    if len(arr) <= 1:
+        return arr
+    pivot = arr[len(arr) // 2]
+    left = [x for x in arr if x < pivot]
+    middle = [x for x in arr if x == pivot]
+    right = [x for x in arr if x > pivot]
+    return quicksort(left) + middle + quicksort(right)
+
+# Example usage:
+numbers = [10, 2, 44, 15, 30, 11, 50]
+sorted_numbers = quicksort(numbers)
+print(sorted_numbers)
+```
+
+This code defines a function `quicksort` that takes a list of integers as input. It divides the list into three parts based on the pivot element, which is the middle element of the list. It then recursively sorts the left and right partitions and combines them with the middle partition.
+````````
+
+Alternatively, we could access the model through python with OpenAI's API (see [complete.py](complete.py)):
+```bash
+python complete.py
+```
+
+## **Optional:** Accessing Code Llama with Chat GUI
+
+It is also possible to access the Code Llama service with a GUI using [FastChat](https://github.com/lm-sys/FastChat). Please check the [demo](#connect-to-hosted-code-llama-with-fastchat-for-chatting).
+
+1. Start the chat web UI:
+```bash
+sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)
+```
+
+2. Then, we can access the GUI at the returned gradio link:
+```
+| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
+```
+
+Note that you may get better results to use a higher temperature and top_p value.
+
+
+## **Optional:** Using Code Llama as Coding Assistant in VScode
+
+[Tabby](https://tabby.tabbyml.com/) is an open-source, self-hosted AI coding assistant. It allows you to connect
+to your own AI models and use them as a coding assistant in VScode. Please check the [demo](#connect-to-hosted-code-llama-with-tabby-as-a-coding-assistant-in-vscode) at the top.
+
+To start a Tabby server that connects to the Code Llama service, run:
+```bash
+sky launch -c tabby ./tabby.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)
+```
+
+To get the endpoint for Tabby server, run:
+```bash
+IP=$(sky status --ip tabby)
+echo Endpoint: http://$IP:8080
+```
+
+Then, you can connect to the Tabby server from VScode by installing the [Tabby extension](https://marketplace.visualstudio.com/items?itemName=tabby-ai.tabby-vscode).
+
+> Note that Code Llama 70B does not have the full infiling functionality [[1](https://huggingface.co/codellama/CodeLlama-70b-Instruct-hf)], so the performance of Tabby with Code Llama may be limited.
+> 
+> To get infiling functionality, you can use the smaller Code Llama models, e.g., Code Llama [7B](https://huggingface.co/codellama/CodeLlama-13B-Instruct-hf) and [13B](https://huggingface.co/codellama/CodeLlama-13B-Instruct-hf), and replace `prompt_template` with `"<｜fim▁begin｜>{prefix}<｜fim▁hole｜>{suffix}<｜fim▁end｜>"` in the [yaml](./tabby.yaml) or the command above.
+> 
+> For better performance, we recommend using Tabby with the recommended models in the [Tabby documentation](https://tabby.tabbyml.com/docs/models/) and our [Tabby example](../tabby).
+
diff --git a/llm/codellama/complete.py b/llm/codellama/complete.py
@@ -0,0 +1,28 @@
+import openai
+
+import sky
+from sky import serve
+
+service_records = sky.serve.status('code-llama')
+endpoint = serve.get_endpoint(service_records[0])
+
+print('Using endpoint:', endpoint)
+
+client = openai.OpenAI(
+    base_url=f'http://{endpoint}/v1',
+    # No API key is required when self-hosted.
+    api_key='EMPTY')
+
+chat_completion = client.chat.completions.create(
+    model='codellama/CodeLlama-70b-Instruct-hf',
+    messages=[{
+        'role': 'system',
+        'content': 'You are a helpful and honest code assistant expert in Python.'
+    }, {
+        'role': 'user',
+        'content': 'Show me the code for quick sort a list of integers.'
+    }],
+    max_tokens=300,
+)
+
+print(chat_completion.model_dump())
diff --git a/llm/codellama/endpoint.yaml b/llm/codellama/endpoint.yaml
@@ -0,0 +1,45 @@
+# An example yaml for serving Code Llama model from Meta with an OpenAI API.
+# Usage:
+#  1. Launch on a single instance: `sky launch -c code-llama ./endpoint.yaml`
+#  2. Scale up to multiple replicas with a single endpoint:
+#     `sky serve up -n code-llama ./endpoint.yaml`
+service:
+  readiness_probe:
+    path: /v1/completions
+    post_data:
+      model: codellama/CodeLlama-70b-Instruct-hf
+      prompt: "def hello_world():"
+      max_tokens: 1
+    initial_delay_seconds: 1800
+  replicas: 2
+
+resources:
+  accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
+  disk_size: 1024
+  disk_tier: best
+  memory: 32+
+  ports: 8000
+
+setup: |
+  conda activate codellama
+  if [ $? -ne 0 ]; then
+    conda create -n codellama python=3.10 -y
+    conda activate codellama
+  fi
+
+  # We have to manually install Torch otherwise apex & xformers won't build
+  pip list | grep torch || pip install "torch>=2.0.0"
+
+  pip list | grep vllm || pip install "git+https://github.com/vllm-project/vllm.git"
+  pip install git+https://github.com/huggingface/transformers
+
+run: |
+  conda activate codellama
+  export PATH=$PATH:/sbin
+  # Reduce --max-num-seqs to avoid OOM during loading model on L4:8
+  python -u -m vllm.entrypoints.openai.api_server \
+    --host 0.0.0.0 \
+    --model codellama/CodeLlama-70b-Instruct-hf \
+    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+    --max-num-seqs 64 | tee ~/openai_api_server.log
+
diff --git a/llm/codellama/gui.yaml b/llm/codellama/gui.yaml
@@ -0,0 +1,50 @@
+# Starts a GUI server that connects to the Code Llama OpenAI API server.
+# This works with the endpoint.yaml, please refer to llm/codellama/README.md
+# for more details.
+# Usage:
+#  1. If you have a endpoint started on a cluster (sky launch):
+#     `sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky status --ip code-llama):8000`
+#  2. If you have a SkyPilot Service started (sky serve up) called code-llama:
+#     `sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)`
+# After the GUI server is started, you will see a gradio link in the output and
+# you can click on it to open the GUI.
+
+envs:
+  ENDPOINT: x.x.x.x:3031 # Address of the API server running codellama. 
+
+resources:
+  cpus: 2
+
+setup: |
+  conda activate codellama
+  if [ $? -ne 0 ]; then
+    conda create -n codellama python=3.10 -y
+    conda activate codellama
+  fi
+
+  pip install "fschat[model_worker,webui]"
+  pip install "openai<1"
+
+run: |
+  conda activate codellama
+  export PATH=$PATH:/sbin
+  WORKER_IP=$(hostname -I | cut -d' ' -f1)
+  CONTROLLER_PORT=21001
+  WORKER_PORT=21002
+
+  cat <<EOF > ~/model_info.json
+  {
+    "codellama/CodeLlama-70b-Instruct-hf": {
+      "model_name": "codellama/CodeLlama-70b-Instruct-hf",
+      "api_base": "http://${ENDPOINT}/v1",
+      "api_key": "empty",
+      "model_path": "codellama/CodeLlama-70b-Instruct-hf"
+    }
+  }
+  EOF
+
+  python3 -m fastchat.serve.controller --host 0.0.0.0 --port ${CONTROLLER_PORT} > ~/controller.log 2>&1 &
+
+  echo 'Starting gradio server...'
+  python -u -m fastchat.serve.gradio_web_server --share \
+    --register-openai-compatible-models ~/model_info.json | tee ~/gradio.log
diff --git a/llm/codellama/tabby.yaml b/llm/codellama/tabby.yaml
@@ -0,0 +1,25 @@
+# Starts a Tabby server that connects to the Code Llama OpenAI API server.
+# This works with the endpoint.yaml, please refer to llm/codellama/README.md
+# for more details.
+# Usage:
+#  1. If you have a endpoint started on a cluster (sky launch):
+#     `sky launch -c tabby ./tabby.yaml --env ENDPOINT=$(sky status --ip code-llama):8000`
+#  2. If you have a SkyPilot Service started (sky serve up) called code-llama:
+#     `sky launch -c tabby ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)`
+# After the Tabby server is started, you can add the endpoint (URL:port) to VSCode
+# Tabby extension and start using it.
+
+envs:
+  ENDPOINT: x.x.x.x:3031 # Address of the API server running codellama. 
+
+resources:
+  cpus: 2
+  ports: 8080
+
+setup: |
+  wget https://github.com/TabbyML/tabby/releases/download/v0.8.0-rc.1/tabby_x86_64-manylinux2014 -O tabby
+  chmod +x tabby
+
+run: |
+  ./tabby serve --device experimental-http \
+    --model "{\"kind\": \"openai\", \"model_name\": \"codellama/CodeLlama-70b-Instruct-hf\", \"api_endpoint\": \"http://$ENDPOINT/v1/completions\", \"prompt_template\": \"{prefix}\"}"
diff --git a/llm/mixtral/serve.yaml b/llm/mixtral/serve.yaml
diff --git a/sky/serve/core.py b/sky/serve/core.py