Skip to content

Commit 5a2f1b8

Browse files
cblmemoMichaelvll
andauthored
[Serve] Proxy w/ retry (#3395)
* init * support streaming * max reetry num * upd comments * remove -L in documentations * streaming smoke test. TODO: debug and make sure it works * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * comments and expose exceptions in smoke test * upd smoke test and passed * timeout * yield error * remove -L * apply suggestions from code review * add threading lock * apply suggestions from code review * comments for limit on client * Update sky/serve/load_balancer.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/serve/load_balancer.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/serve/load_balancer.py Co-authored-by: Zhanghao Wu <[email protected]> * format * retry for no replicas as well * check disconnect if no replcias * format * minor * async probe controller; close clients in the background * async * comments * Update sky/serve/load_balancer.py Co-authored-by: Zhanghao Wu <[email protected]> * format * fix --------- Co-authored-by: Zhanghao Wu <[email protected]>
1 parent 8a0a34d commit 5a2f1b8

26 files changed

+348
-131
lines changed

docs/source/serving/sky-serve.rst

+8-14
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Why SkyServe?
2222
2323
How it works:
2424

25-
- Each service gets an endpoint that automatically redirects requests to its replicas.
25+
- Each service gets an endpoint that automatically distributes requests to its replicas.
2626
- Replicas of the same service can run in different regions and clouds — reducing cloud costs and increasing availability.
2727
- SkyServe handles the load balancing, recovery, and autoscaling of the replicas.
2828

@@ -127,7 +127,7 @@ Run :code:`sky serve up service.yaml` to deploy the service with automatic price
127127

128128
If you see the :code:`STATUS` column becomes :code:`READY`, then the service is ready to accept traffic!
129129

130-
Simply ``curl -L`` the service endpoint, which automatically load-balances across the two replicas:
130+
Simply ``curl`` the service endpoint, which automatically load-balances across the two replicas:
131131

132132
.. tab-set::
133133

@@ -136,7 +136,7 @@ Simply ``curl -L`` the service endpoint, which automatically load-balances acros
136136

137137
.. code-block:: console
138138
139-
$ curl -L 3.84.15.251:30001/v1/chat/completions \
139+
$ curl 3.84.15.251:30001/v1/chat/completions \
140140
-X POST \
141141
-d '{"model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "messages": [{"role": "user", "content": "Who are you?"}]}' \
142142
-H 'Content-Type: application/json'
@@ -149,7 +149,7 @@ Simply ``curl -L`` the service endpoint, which automatically load-balances acros
149149

150150
.. code-block:: console
151151
152-
$ curl -L 44.211.131.51:30001/generate \
152+
$ curl 44.211.131.51:30001/generate \
153153
-X POST \
154154
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
155155
-H 'Content-Type: application/json'
@@ -240,7 +240,7 @@ Under the hood, :code:`sky serve up`:
240240
#. Launches a controller which handles autoscaling, monitoring and load balancing;
241241
#. Returns a Service Endpoint which will be used to accept traffic;
242242
#. Meanwhile, the controller provisions replica VMs which later run the services;
243-
#. Once any replica is ready, the requests sent to the Service Endpoint will be **HTTP-redirect** to one of the endpoint replicas.
243+
#. Once any replica is ready, the requests sent to the Service Endpoint will be distributed to one of the endpoint replicas.
244244

245245
After the controller is provisioned, you'll see the following in :code:`sky serve status` output:
246246

@@ -264,7 +264,7 @@ sending requests to :code:`<endpoint-url>` (e.g., ``44.201.119.3:30001``):
264264

265265
.. code-block:: console
266266
267-
$ curl -L <endpoint-url>
267+
$ curl <endpoint-url>
268268
<html>
269269
<head>
270270
<title>My First SkyServe Service</title>
@@ -274,12 +274,6 @@ sending requests to :code:`<endpoint-url>` (e.g., ``44.201.119.3:30001``):
274274
</body>
275275
</html>
276276
277-
.. note::
278-
279-
Since we are using HTTP-redirect, we need to use :code:`curl -L
280-
<endpoint-url>`. The :code:`curl` command by default won't follow the
281-
redirect.
282-
283277
Tutorial: Serve a Chatbot LLM!
284278
------------------------------
285279

@@ -368,7 +362,7 @@ Send a request using the following cURL command:
368362

369363
.. code-block:: console
370364
371-
$ curl -L http://<endpoint-url>/v1/chat/completions \
365+
$ curl http://<endpoint-url>/v1/chat/completions \
372366
-X POST \
373367
-d '{"model":"vicuna-7b-v1.3","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Who are you?"}],"temperature":0}' \
374368
-H 'Content-Type: application/json'
@@ -468,7 +462,7 @@ SkyServe has a centralized controller VM that manages the deployment of your ser
468462
It is composed of the following components:
469463

470464
#. **Controller**: The controller will monitor the status of the replicas and re-launch a new replica if one of them fails. It also autoscales the number of replicas if autoscaling config is set (see :ref:`Service YAML spec <service-yaml-spec>` for more information).
471-
#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and **HTTP-redirects** the requests to one of the replicas.
465+
#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and distribute the requests to one of the replicas.
472466

473467
All of the process group shares a single controller VM. The controller VM will be launched in the cloud with the best price/performance ratio. You can also :ref:`customize the controller resources <customizing-sky-serve-controller-resources>` based on your needs.
474468

examples/cog/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ After the service is launched, access the deployment with the following:
2828
```console
2929
ENDPOINT=$(sky serve status --endpoint cog)
3030

31-
curl -L http://$ENDPOINT/predictions -X POST \
31+
curl http://$ENDPOINT/predictions -X POST \
3232
-H 'Content-Type: application/json' \
3333
-d '{"input": {"image": "https://blog.skypilot.co/introducing-sky-serve/images/sky-serve-thumbnail.png"}}' \
3434
| jq -r '.output | split(",")[1]' | base64 --decode > output.png

examples/serve/misc/cancel/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# SkyServe cancel example
22

3-
This example demonstrates the redirect support canceling a request.
3+
This example demonstrates the SkyServe load balancer support canceling a request.
44

55
## Running the example
66

@@ -33,7 +33,7 @@ Client disconnected, stopping computation.
3333
You can also run
3434

3535
```bash
36-
curl -L http://<endpoint>/
36+
curl http://<endpoint>/
3737
```
3838

3939
and manually Ctrl + C to cancel the request and see logs.

examples/serve/stable_diffusion_service.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ file_mounts:
1818
/stable_diffusion: examples/stable_diffusion
1919

2020
setup: |
21-
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
21+
sudo curl "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
2222
sudo chmod +x /usr/local/bin/docker-compose
2323
cd stable-diffusion-webui-docker
2424
sudo rm -r stable-diffusion-webui-docker

examples/stable_diffusion/stable_diffusion_docker.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ file_mounts:
77
/stable_diffusion: .
88

99
setup: |
10-
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
10+
sudo curl "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
1111
sudo chmod +x /usr/local/bin/docker-compose
1212
cd stable-diffusion-webui-docker
1313
sudo rm -r stable-diffusion-webui-docker

llm/codellama/README.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ Launching a cluster 'code-llama'. Proceed? [Y/n]:
6868
```bash
6969
IP=$(sky status --ip code-llama)
7070

71-
curl -L http://$IP:8000/v1/completions \
71+
curl http://$IP:8000/v1/completions \
7272
-H "Content-Type: application/json" \
7373
-d '{
7474
"model": "codellama/CodeLlama-70b-Instruct-hf",
@@ -131,7 +131,7 @@ availability of the service while minimizing the cost.
131131
```bash
132132
ENDPOINT=$(sky serve status --endpoint code-llama)
133133

134-
curl -L http://$ENDPOINT/v1/completions \
134+
curl http://$ENDPOINT/v1/completions \
135135
-H "Content-Type: application/json" \
136136
-d '{
137137
"model": "codellama/CodeLlama-70b-Instruct-hf",
@@ -146,7 +146,7 @@ We can also access the Code Llama service with the openAI Chat API.
146146
```bash
147147
ENDPOINT=$(sky serve status --endpoint code-llama)
148148

149-
curl -L http://$ENDPOINT/v1/chat/completions \
149+
curl http://$ENDPOINT/v1/chat/completions \
150150
-H "Content-Type: application/json" \
151151
-d '{
152152
"model": "codellama/CodeLlama-70b-Instruct-hf",

llm/dbrx/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -256,7 +256,7 @@ ENDPOINT=$(sky serve status --endpoint dbrx)
256256
257257
To curl the endpoint:
258258
```console
259-
curl -L $ENDPOINT/v1/chat/completions \
259+
curl $ENDPOINT/v1/chat/completions \
260260
-H "Content-Type: application/json" \
261261
-d '{
262262
"model": "databricks/dbrx-instruct",

llm/gemma/README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ After the cluster is launched, we can access the model with the following comman
3737
```bash
3838
IP=$(sky status --ip gemma)
3939

40-
curl -L http://$IP:8000/v1/completions \
40+
curl http://$IP:8000/v1/completions \
4141
-H "Content-Type: application/json" \
4242
-d '{
4343
"model": "google/gemma-7b-it",
@@ -50,7 +50,7 @@ Chat API is also supported:
5050
```bash
5151
IP=$(sky status --ip gemma)
5252

53-
curl -L http://$IP:8000/v1/chat/completions \
53+
curl http://$IP:8000/v1/chat/completions \
5454
-H "Content-Type: application/json" \
5555
-d '{
5656
"model": "google/gemma-7b-it",
@@ -78,7 +78,7 @@ After the cluster is launched, we can access the model with the following comman
7878
```bash
7979
ENDPOINT=$(sky serve status --endpoint gemma)
8080

81-
curl -L http://$ENDPOINT/v1/completions \
81+
curl http://$ENDPOINT/v1/completions \
8282
-H "Content-Type: application/json" \
8383
-d '{
8484
"model": "google/gemma-7b-it",
@@ -89,7 +89,7 @@ curl -L http://$ENDPOINT/v1/completions \
8989

9090
Chat API is also supported:
9191
```bash
92-
curl -L http://$ENDPOINT/v1/chat/completions \
92+
curl http://$ENDPOINT/v1/chat/completions \
9393
-H "Content-Type: application/json" \
9494
-d '{
9595
"model": "google/gemma-7b-it",

llm/mixtral/README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ We can now access the model through the OpenAI API with the IP and port:
5353
```bash
5454
IP=$(sky status --ip mixtral)
5555

56-
curl -L http://$IP:8000/v1/completions \
56+
curl http://$IP:8000/v1/completions \
5757
-H "Content-Type: application/json" \
5858
-d '{
5959
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
@@ -66,7 +66,7 @@ Chat API is also supported:
6666
```bash
6767
IP=$(sky status --ip mixtral)
6868

69-
curl -L http://$IP:8000/v1/chat/completions \
69+
curl http://$IP:8000/v1/chat/completions \
7070
-H "Content-Type: application/json" \
7171
-d '{
7272
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
@@ -119,7 +119,7 @@ After the `sky serve up` command, there will be a single endpoint for the servic
119119
```bash
120120
ENDPOINT=$(sky serve status --endpoint mixtral)
121121
122-
curl -L http://$ENDPOINT/v1/completions \
122+
curl http://$ENDPOINT/v1/completions \
123123
-H "Content-Type: application/json" \
124124
-d '{
125125
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
@@ -132,7 +132,7 @@ Chat API is also supported:
132132
```bash
133133
ENDPOINT=$(sky serve status --endpoint mixtral)
134134
135-
curl -L http://$ENDPOINT/v1/chat/completions \
135+
curl http://$ENDPOINT/v1/chat/completions \
136136
-H "Content-Type: application/json" \
137137
-d '{
138138
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",

llm/qwen/README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ sky launch -c qwen serve-110b.yaml
3434
```bash
3535
IP=$(sky status --ip qwen)
3636

37-
curl -L http://$IP:8000/v1/completions \
37+
curl http://$IP:8000/v1/completions \
3838
-H "Content-Type: application/json" \
3939
-d '{
4040
"model": "Qwen/Qwen1.5-110B-Chat",
@@ -45,7 +45,7 @@ curl -L http://$IP:8000/v1/completions \
4545

4646
3. Send a request for chat completion:
4747
```bash
48-
curl -L http://$IP:8000/v1/chat/completions \
48+
curl http://$IP:8000/v1/chat/completions \
4949
-H "Content-Type: application/json" \
5050
-d '{
5151
"model": "Qwen/Qwen1.5-110B-Chat",
@@ -92,11 +92,11 @@ As shown, the service is now backed by 2 replicas, one on Azure and one on GCP,
9292
type is chosen to be **the cheapest available one** on the clouds. That said, it maximizes the
9393
availability of the service while minimizing the cost.
9494

95-
3. To access the model, we use a `curl -L` command (`-L` to follow redirect) to send the request to the endpoint:
95+
3. To access the model, we use a `curl` command to send the request to the endpoint:
9696
```bash
9797
ENDPOINT=$(sky serve status --endpoint qwen)
9898

99-
curl -L http://$ENDPOINT/v1/chat/completions \
99+
curl http://$ENDPOINT/v1/chat/completions \
100100
-H "Content-Type: application/json" \
101101
-d '{
102102
"model": "Qwen/Qwen1.5-72B-Chat",

llm/sglang/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ ENDPOINT=$(sky serve status --endpoint sglang-llava)
6868
</figure>
6969

7070
```bash
71-
curl -L $ENDPOINT/v1/chat/completions \
71+
curl $ENDPOINT/v1/chat/completions \
7272
-H "Content-Type: application/json" \
7373
-d '{
7474
"model": "liuhaotian/llava-v1.6-vicuna-7b",
@@ -149,7 +149,7 @@ ENDPOINT=$(sky serve status --endpoint sglang-llama2)
149149
4. Once it status is `READY`, you can use the endpoint to interact with the model:
150150

151151
```bash
152-
curl -L $ENDPOINT/v1/chat/completions \
152+
curl $ENDPOINT/v1/chat/completions \
153153
-H "Content-Type: application/json" \
154154
-d '{
155155
"model": "meta-llama/Llama-2-7b-chat-hf",

llm/tgi/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ A user can access the model with the following command:
1717
```bash
1818
ENDPOINT=$(sky status --endpoint 8080 tgi)
1919

20-
curl -L $(sky serve status tgi --endpoint)/generate \
20+
curl $(sky serve status tgi --endpoint)/generate \
2121
-H 'Content-Type: application/json' \
2222
-d '{
2323
"inputs": "What is Deep Learning?",
@@ -51,7 +51,7 @@ After the service is launched, we can access the model with the following comman
5151
```bash
5252
ENDPOINT=$(sky serve status --endpoint tgi)
5353

54-
curl -L $ENDPOINT/generate \
54+
curl $ENDPOINT/generate \
5555
-H 'Content-Type: application/json' \
5656
-d '{
5757
"inputs": "What is Deep Learning?",

llm/vllm/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,7 @@ ENDPOINT=$(sky serve status --endpoint vllm-llama2)
154154
4. Once it status is `READY`, you can use the endpoint to interact with the model:
155155

156156
```bash
157-
curl -L $ENDPOINT/v1/chat/completions \
157+
curl $ENDPOINT/v1/chat/completions \
158158
-H "Content-Type: application/json" \
159159
-d '{
160160
"model": "meta-llama/Llama-2-7b-chat-hf",
@@ -171,7 +171,7 @@ curl -L $ENDPOINT/v1/chat/completions \
171171
}'
172172
```
173173

174-
Notice that it is the same with previously curl command, except for thr `-L` argument. You should get a similar response as the following:
174+
Notice that it is the same with previously curl command. You should get a similar response as the following:
175175

176176
```console
177177
{

sky/serve/README.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Serving library for SkyPilot.
44

5-
The goal of Sky Serve is simple - expose one endpoint, that redirects to serving endpoints running on different resources, regions and clouds.
5+
The goal of Sky Serve is simple - exposing one endpoint, that distributes any incoming traffic to serving endpoints running on different resources, regions, and clouds.
66

77
Sky Serve transparently handles load balancing, failover and autoscaling of the serving endpoints.
88

@@ -11,8 +11,8 @@ Sky Serve transparently handles load balancing, failover and autoscaling of the
1111
![Architecture](../../docs/source/images/sky-serve-architecture.png)
1212

1313
Sky Serve has four key components:
14-
1. Redirector - receiving requests and redirecting them to healthy endpoints.
15-
2. Load balancers - spread requests across healthy endpoints according to different policies.
14+
1. Load Balancers - receiving requests and distributing them to healthy endpoints.
15+
2. Load Balancing Policies - spread requests across healthy endpoints according to different policies.
1616
3. Autoscalers - scale up and down the number of serving endpoints according to different policies.
1717
4. Replica Managers - monitoring replica status and handle recovery of unhealthy endpoints.
1818

sky/serve/constants.py

+12
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,18 @@
2121
# interval.
2222
LB_CONTROLLER_SYNC_INTERVAL_SECONDS = 20
2323

24+
# The maximum retry times for load balancer for each request. After changing to
25+
# proxy implementation, we do retry for failed requests.
26+
# TODO(tian): Expose this option to users in yaml file.
27+
LB_MAX_RETRY = 3
28+
29+
# The timeout in seconds for load balancer to wait for a response from replica.
30+
# Large LLMs like Llama2-70b is able to process the request within ~30 seconds.
31+
# We set the timeout to 120s to be safe. For reference, FastChat uses 100s:
32+
# https://github.com/lm-sys/FastChat/blob/f2e6ca964af7ad0585cadcf16ab98e57297e2133/fastchat/constants.py#L39 # pylint: disable=line-too-long
33+
# TODO(tian): Expose this option to users in yaml file.
34+
LB_STREAM_TIMEOUT = 120
35+
2436
# Interval in seconds to probe replica endpoint.
2537
ENDPOINT_PROBE_INTERVAL_SECONDS = 10
2638

sky/serve/core.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -285,7 +285,7 @@ def up(
285285
f'{backend_utils.BOLD}watch -n10 sky serve status {service_name}'
286286
f'{backend_utils.RESET_BOLD}'
287287
'\nTo send a test request:\t\t'
288-
f'{backend_utils.BOLD}curl -L {endpoint}'
288+
f'{backend_utils.BOLD}curl {endpoint}'
289289
f'{backend_utils.RESET_BOLD}'
290290
'\n'
291291
f'\n{fore.GREEN}SkyServe is spinning up your service now.'

0 commit comments

Comments
 (0)