skypilot-org
diff --git a/‎docs/source/serving/sky-serve.rst
+8-14 b/‎docs/source/serving/sky-serve.rst
+8-14
diff --git a/‎examples/cog/README.md
+1-1 b/‎examples/cog/README.md
+1-1
diff --git a/‎examples/serve/misc/cancel/README.md
+2-2 b/‎examples/serve/misc/cancel/README.md
+2-2
diff --git a/‎examples/serve/stable_diffusion_service.yaml
+1-1 b/‎examples/serve/stable_diffusion_service.yaml
+1-1
diff --git a/‎examples/stable_diffusion/stable_diffusion_docker.yaml
+1-1 b/‎examples/stable_diffusion/stable_diffusion_docker.yaml
+1-1
diff --git a/‎llm/codellama/README.md
+3-3 b/‎llm/codellama/README.md
+3-3
diff --git a/‎llm/dbrx/README.md
+1-1 b/‎llm/dbrx/README.md
+1-1
diff --git a/‎llm/gemma/README.md
+4-4 b/‎llm/gemma/README.md
+4-4
diff --git a/‎llm/mixtral/README.md
+4-4 b/‎llm/mixtral/README.md
+4-4
diff --git a/‎llm/qwen/README.md
+4-4 b/‎llm/qwen/README.md
+4-4
diff --git a/‎llm/sglang/README.md
+2-2 b/‎llm/sglang/README.md
+2-2
diff --git a/‎llm/tgi/README.md
+2-2 b/‎llm/tgi/README.md
+2-2
diff --git a/‎llm/vllm/README.md
+2-2 b/‎llm/vllm/README.md
+2-2
diff --git a/‎sky/serve/README.md
+3-3 b/‎sky/serve/README.md
+3-3
diff --git a/‎sky/serve/constants.py
+12 b/‎sky/serve/constants.py
+12
diff --git a/‎sky/serve/core.py
+1-1 b/‎sky/serve/core.py
+1-1
@@ -22,7 +22,7 @@ Why SkyServe?
 
 How it works:
 
-- Each service gets an endpoint that automatically redirects requests to its replicas.
+- Each service gets an endpoint that automatically distributes requests to its replicas.
 - Replicas of the same service can run in different regions and clouds — reducing cloud costs and increasing availability.
 - SkyServe handles the load balancing, recovery, and autoscaling of the replicas.
 
@@ -127,7 +127,7 @@ Run :code:`sky serve up service.yaml` to deploy the service with automatic price
 
 If you see the :code:`STATUS` column becomes :code:`READY`, then the service is ready to accept traffic!
 
-Simply ``curl -L`` the service endpoint, which automatically load-balances across the two replicas:
+Simply ``curl`` the service endpoint, which automatically load-balances across the two replicas:
 
 .. tab-set::
 
@@ -136,7 +136,7 @@ Simply ``curl -L`` the service endpoint, which automatically load-balances acros
 
         .. code-block:: console
 
-            $ curl -L 3.84.15.251:30001/v1/chat/completions \
+            $ curl 3.84.15.251:30001/v1/chat/completions \
                 -X POST \
                 -d '{"model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "messages": [{"role": "user", "content": "Who are you?"}]}' \
                 -H 'Content-Type: application/json'
@@ -149,7 +149,7 @@ Simply ``curl -L`` the service endpoint, which automatically load-balances acros
 
         .. code-block:: console
 
-            $ curl -L 44.211.131.51:30001/generate \
+            $ curl 44.211.131.51:30001/generate \
                 -X POST \
                 -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
                 -H 'Content-Type: application/json'
@@ -240,7 +240,7 @@ Under the hood, :code:`sky serve up`:
 #. Launches a controller which handles autoscaling, monitoring and load balancing;
 #. Returns a Service Endpoint which will be used to accept traffic;
 #. Meanwhile, the controller provisions replica VMs which later run the services;
-#. Once any replica is ready, the requests sent to the Service Endpoint will be **HTTP-redirect** to one of the endpoint replicas.
+#. Once any replica is ready, the requests sent to the Service Endpoint will be distributed to one of the endpoint replicas.
 
 After the controller is provisioned, you'll see the following in :code:`sky serve status` output:
 
@@ -264,7 +264,7 @@ sending requests to :code:`<endpoint-url>` (e.g., ``44.201.119.3:30001``):
 
 .. code-block:: console
 
-    $ curl -L <endpoint-url>
+    $ curl <endpoint-url>
     <html>
     <head>
         <title>My First SkyServe Service</title>
@@ -274,12 +274,6 @@ sending requests to :code:`<endpoint-url>` (e.g., ``44.201.119.3:30001``):
     </body>
     </html>
 
-.. note::
-
-  Since we are using HTTP-redirect, we need to use :code:`curl -L
-  <endpoint-url>`. The :code:`curl` command by default won't follow the
-  redirect.
-
 Tutorial: Serve a Chatbot LLM!
 ------------------------------
 
@@ -368,7 +362,7 @@ Send a request using the following cURL command:
 
 .. code-block:: console
 
-    $ curl -L http://<endpoint-url>/v1/chat/completions \
+    $ curl http://<endpoint-url>/v1/chat/completions \
         -X POST \
         -d '{"model":"vicuna-7b-v1.3","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Who are you?"}],"temperature":0}' \
         -H 'Content-Type: application/json'
@@ -468,7 +462,7 @@ SkyServe has a centralized controller VM that manages the deployment of your ser
 It is composed of the following components:
 
 #. **Controller**: The controller will monitor the status of the replicas and re-launch a new replica if one of them fails. It also autoscales the number of replicas if autoscaling config is set (see :ref:`Service YAML spec <service-yaml-spec>` for more information).
-#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and **HTTP-redirects** the requests to one of the replicas.
+#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and distribute the requests to one of the replicas.
 
 All of the process group shares a single controller VM. The controller VM will be launched in the cloud with the best price/performance ratio. You can also :ref:`customize the controller resources <customizing-sky-serve-controller-resources>` based on your needs.
 
 
@@ -28,7 +28,7 @@ After the service is launched, access the deployment with the following:
 ```console
 ENDPOINT=$(sky serve status --endpoint cog)
 
-curl -L http://$ENDPOINT/predictions -X POST \
+curl http://$ENDPOINT/predictions -X POST \
   -H 'Content-Type: application/json' \
   -d '{"input": {"image": "https://blog.skypilot.co/introducing-sky-serve/images/sky-serve-thumbnail.png"}}' \
   | jq -r '.output | split(",")[1]' | base64 --decode > output.png
 
@@ -1,6 +1,6 @@
 # SkyServe cancel example
 
-This example demonstrates the redirect support canceling a request.
+This example demonstrates the SkyServe load balancer support canceling a request.
 
 ## Running the example
 
@@ -33,7 +33,7 @@ Client disconnected, stopping computation.
 You can also run
 
 ```bash
-curl -L http://<endpoint>/
+curl http://<endpoint>/
 ```
 
 and manually Ctrl + C to cancel the request and see logs.
@@ -18,7 +18,7 @@ file_mounts:
   /stable_diffusion: examples/stable_diffusion
 
 setup: |
-  sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
+  sudo curl "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
   sudo chmod +x /usr/local/bin/docker-compose
   cd stable-diffusion-webui-docker  
   sudo rm -r stable-diffusion-webui-docker
 
@@ -7,7 +7,7 @@ file_mounts:
   /stable_diffusion: .
 
 setup: |
-  sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
+  sudo curl "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
   sudo chmod +x /usr/local/bin/docker-compose
   cd stable-diffusion-webui-docker  
   sudo rm -r stable-diffusion-webui-docker
 
@@ -68,7 +68,7 @@ Launching a cluster 'code-llama'. Proceed? [Y/n]:
 ```bash
 IP=$(sky status --ip code-llama)
 
-curl -L http://$IP:8000/v1/completions \
+curl http://$IP:8000/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "codellama/CodeLlama-70b-Instruct-hf",
@@ -131,7 +131,7 @@ availability of the service while minimizing the cost.
 ```bash
 ENDPOINT=$(sky serve status --endpoint code-llama)
 
-curl -L http://$ENDPOINT/v1/completions \
+curl http://$ENDPOINT/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "codellama/CodeLlama-70b-Instruct-hf",
@@ -146,7 +146,7 @@ We can also access the Code Llama service with the openAI Chat API.
 ```bash
 ENDPOINT=$(sky serve status --endpoint code-llama)
 
-curl -L http://$ENDPOINT/v1/chat/completions \
+curl http://$ENDPOINT/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "codellama/CodeLlama-70b-Instruct-hf",
 
@@ -256,7 +256,7 @@ ENDPOINT=$(sky serve status --endpoint dbrx)
 
 To curl the endpoint:
 ```console
-curl -L $ENDPOINT/v1/chat/completions \
+curl $ENDPOINT/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "databricks/dbrx-instruct",
 
@@ -37,7 +37,7 @@ After the cluster is launched, we can access the model with the following comman
 ```bash
 IP=$(sky status --ip gemma)
 
-curl -L http://$IP:8000/v1/completions \
+curl http://$IP:8000/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "google/gemma-7b-it",
@@ -50,7 +50,7 @@ Chat API is also supported:
 ```bash
 IP=$(sky status --ip gemma)
 
-curl -L http://$IP:8000/v1/chat/completions \
+curl http://$IP:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "google/gemma-7b-it",
@@ -78,7 +78,7 @@ After the cluster is launched, we can access the model with the following comman
 ```bash
 ENDPOINT=$(sky serve status --endpoint gemma)
 
-curl -L http://$ENDPOINT/v1/completions \
+curl http://$ENDPOINT/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "google/gemma-7b-it",
@@ -89,7 +89,7 @@ curl -L http://$ENDPOINT/v1/completions \
 
 Chat API is also supported:
 ```bash
-curl -L http://$ENDPOINT/v1/chat/completions \
+curl http://$ENDPOINT/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "google/gemma-7b-it",
 
@@ -53,7 +53,7 @@ We can now access the model through the OpenAI API with the IP and port:
 ```bash
 IP=$(sky status --ip mixtral)
 
-curl -L http://$IP:8000/v1/completions \
+curl http://$IP:8000/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
@@ -66,7 +66,7 @@ Chat API is also supported:
 ```bash
 IP=$(sky status --ip mixtral)
 
-curl -L http://$IP:8000/v1/chat/completions \
+curl http://$IP:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
@@ -119,7 +119,7 @@ After the `sky serve up` command, there will be a single endpoint for the servic
 ```bash
 ENDPOINT=$(sky serve status --endpoint mixtral)
 
-curl -L http://$ENDPOINT/v1/completions \
+curl http://$ENDPOINT/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
@@ -132,7 +132,7 @@ Chat API is also supported:
 ```bash
 ENDPOINT=$(sky serve status --endpoint mixtral)
 
-curl -L http://$ENDPOINT/v1/chat/completions \
+curl http://$ENDPOINT/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
 
@@ -34,7 +34,7 @@ sky launch -c qwen serve-110b.yaml
 ```bash
 IP=$(sky status --ip qwen)
 
-curl -L http://$IP:8000/v1/completions \
+curl http://$IP:8000/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "Qwen/Qwen1.5-110B-Chat",
@@ -45,7 +45,7 @@ curl -L http://$IP:8000/v1/completions \
 
 3. Send a request for chat completion:
 ```bash
-curl -L http://$IP:8000/v1/chat/completions \
+curl http://$IP:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "Qwen/Qwen1.5-110B-Chat",
@@ -92,11 +92,11 @@ As shown, the service is now backed by 2 replicas, one on Azure and one on GCP,
 type is chosen to be **the cheapest available one** on the clouds. That said, it maximizes the
 availability of the service while minimizing the cost.
 
-3. To access the model, we use a `curl -L` command (`-L` to follow redirect) to send the request to the endpoint:
+3. To access the model, we use a `curl` command to send the request to the endpoint:
 ```bash
 ENDPOINT=$(sky serve status --endpoint qwen)
 
-curl -L http://$ENDPOINT/v1/chat/completions \
+curl http://$ENDPOINT/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "Qwen/Qwen1.5-72B-Chat",
 
@@ -68,7 +68,7 @@ ENDPOINT=$(sky serve status --endpoint sglang-llava)
 </figure>
 
 ```bash
-curl -L $ENDPOINT/v1/chat/completions \
+curl $ENDPOINT/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "liuhaotian/llava-v1.6-vicuna-7b",
@@ -149,7 +149,7 @@ ENDPOINT=$(sky serve status --endpoint sglang-llama2)
 4. Once it status is `READY`, you can use the endpoint to interact with the model:
 
 ```bash
-curl -L $ENDPOINT/v1/chat/completions \
+curl $ENDPOINT/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "meta-llama/Llama-2-7b-chat-hf",
 
@@ -17,7 +17,7 @@ A user can access the model with the following command:
 ```bash
 ENDPOINT=$(sky status --endpoint 8080 tgi)
 
-curl -L $(sky serve status tgi --endpoint)/generate \
+curl $(sky serve status tgi --endpoint)/generate \
     -H 'Content-Type: application/json' \
     -d '{
       "inputs": "What is Deep Learning?",
@@ -51,7 +51,7 @@ After the service is launched, we can access the model with the following comman
 ```bash
 ENDPOINT=$(sky serve status --endpoint tgi)
 
-curl -L $ENDPOINT/generate \
+curl $ENDPOINT/generate \
     -H 'Content-Type: application/json' \
     -d '{
       "inputs": "What is Deep Learning?",
 
@@ -154,7 +154,7 @@ ENDPOINT=$(sky serve status --endpoint vllm-llama2)
 4. Once it status is `READY`, you can use the endpoint to interact with the model:
 
 ```bash
-curl -L $ENDPOINT/v1/chat/completions \
+curl $ENDPOINT/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "meta-llama/Llama-2-7b-chat-hf",
@@ -171,7 +171,7 @@ curl -L $ENDPOINT/v1/chat/completions \
   }'
 ```
 
-Notice that it is the same with previously curl command, except for thr `-L` argument. You should get a similar response as the following:
+Notice that it is the same with previously curl command. You should get a similar response as the following:
 
 ```console
 {
 
@@ -2,7 +2,7 @@
 
 Serving library for SkyPilot.
 
-The goal of Sky Serve is simple - expose one endpoint, that redirects to serving endpoints running on different resources, regions and clouds.
+The goal of Sky Serve is simple - exposing one endpoint, that distributes any incoming traffic to serving endpoints running on different resources, regions, and clouds.
 
 Sky Serve transparently handles load balancing, failover and autoscaling of the serving endpoints.
 
@@ -11,8 +11,8 @@ Sky Serve transparently handles load balancing, failover and autoscaling of the
 ![Architecture](../../docs/source/images/sky-serve-architecture.png)
 
 Sky Serve has four key components:
-1. Redirector - receiving requests and redirecting them to healthy endpoints.
-2. Load balancers - spread requests across healthy endpoints according to different policies.
+1. Load Balancers - receiving requests and distributing them to healthy endpoints.
+2. Load Balancing Policies - spread requests across healthy endpoints according to different policies.
 3. Autoscalers - scale up and down the number of serving endpoints according to different policies.
 4. Replica Managers -  monitoring replica status and handle recovery of unhealthy endpoints.
 
 
@@ -21,6 +21,18 @@
 # interval.
 LB_CONTROLLER_SYNC_INTERVAL_SECONDS = 20
 
+# The maximum retry times for load balancer for each request. After changing to
+# proxy implementation, we do retry for failed requests.
+# TODO(tian): Expose this option to users in yaml file.
+LB_MAX_RETRY = 3
+
+# The timeout in seconds for load balancer to wait for a response from replica.
+# Large LLMs like Llama2-70b is able to process the request within ~30 seconds.
+# We set the timeout to 120s to be safe. For reference, FastChat uses 100s:
+# https://github.com/lm-sys/FastChat/blob/f2e6ca964af7ad0585cadcf16ab98e57297e2133/fastchat/constants.py#L39 # pylint: disable=line-too-long
+# TODO(tian): Expose this option to users in yaml file.
+LB_STREAM_TIMEOUT = 120
+
 # Interval in seconds to probe replica endpoint.
 ENDPOINT_PROBE_INTERVAL_SECONDS = 10
 
 
@@ -285,7 +285,7 @@ def up(
             f'{backend_utils.BOLD}watch -n10 sky serve status {service_name}'
             f'{backend_utils.RESET_BOLD}'
             '\nTo send a test request:\t\t'
-            f'{backend_utils.BOLD}curl -L {endpoint}'
+            f'{backend_utils.BOLD}curl {endpoint}'
             f'{backend_utils.RESET_BOLD}'
             '\n'
             f'\n{fore.GREEN}SkyServe is spinning up your service now.'