Skip to content

Commit b272a63

Browse files
authored
Add prod recommendations and migrating guide (#2334)
1 parent 95cc64c commit b272a63

24 files changed

+435
-246
lines changed

docs/clients/install.md

+11-11
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,16 @@
11
# Install
22

3-
## Install with pip
3+
## Install the CLI
4+
5+
<!-- CORTEX_VERSION_README x2 -->
6+
```bash
7+
# download CLI version 0.38.0 (Note the "v"):
8+
bash -c "$(curl -sS https://github.com/raw/cortexlabs/cortex/v0.38.0/get-cli.sh)"
9+
```
10+
11+
By default, the Cortex CLI is installed at `/usr/local/bin/cortex`. To install the executable elsewhere, export the `CORTEX_INSTALL_PATH` environment variable to your desired location before running the command above.
12+
13+
## Install the CLI and Python client via pip
414

515
To install the latest version:
616

@@ -21,16 +31,6 @@ To upgrade to the latest version:
2131
pip install --upgrade cortex
2232
```
2333

24-
## Install without the Python client
25-
26-
<!-- CORTEX_VERSION_README x2 -->
27-
```bash
28-
# For example to download CLI version 0.38.0 (Note the "v"):
29-
bash -c "$(curl -sS https://github.com/raw/cortexlabs/cortex/v0.38.0/get-cli.sh)"
30-
```
31-
32-
By default, the Cortex CLI is installed at `/usr/local/bin/cortex`. To install the executable elsewhere, export the `CORTEX_INSTALL_PATH` environment variable to your desired location before running the command above.
33-
3434
## Changing the CLI/client configuration directory
3535

3636
By default, the CLI/client creates a directory at `~/.cortex/` and uses it to store environment configuration. To use a different directory, export the `CORTEX_CLI_CONFIG_DIR` environment variable before running any `cortex` commands.

docs/clusters/instances/spot.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ node_groups:
1717
on_demand_base_capacity: 0
1818

1919
# percentage of on demand instances to use after the on demand base capacity has been met [0, 100] (default: 50)
20-
# note: setting this to 0 may hinder cluster scale up when spot instances are not available
20+
# note: setting this to 0 may hinder cluster scale-up when spot instances are not available
2121
on_demand_percentage_above_base_capacity: 0
2222

2323
# max price for spot instances (default: the on-demand price of the primary instance type)

docs/clusters/management/create.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,10 @@
99

1010
## Create a cluster on your AWS account
1111

12+
<!-- CORTEX_VERSION_README -->
1213
```bash
13-
# install the CLI
14-
pip install cortex
14+
# install the cortex CLI
15+
bash -c "$(curl -sS https://github.com/raw/cortexlabs/cortex/v0.38.0/get-cli.sh)"
1516

1617
# create a cluster
1718
cortex cluster up cluster.yaml

docs/clusters/management/delete.md

+6-3
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,13 @@ cortex cluster down
88

99
When a Cortex cluster is created, an S3 bucket is created for its internal use. When running `cortex cluster down`, a lifecycle rule is applied to the bucket such that its entire contents are removed within the next 24 hours. You can safely delete the bucket at any time after `cortex cluster down` has finished running.
1010

11-
## Delete Certificates
11+
## Delete SSL Certificate
1212

13-
If you've configured a custom domain for your APIs, you can remove the SSL Certificate and Hosted Zone for the domain by
14-
following these [instructions](../networking/custom-domain.md#cleanup).
13+
If you've set up HTTPS, you can remove the SSL Certificate by following these [instructions](../networking/https.md#cleanup).
14+
15+
## Delete Hosted Zone
16+
17+
If you've configured a custom domain for your APIs, follow these [instructions](../networking/custom-domain.md#cleanup) to delete the Hosted Zone.
1518

1619
## Keep Cortex Resources
1720

+89
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Production guide
2+
3+
As you take Cortex from development to production, here are a few pointers that might be useful.
4+
5+
## Use images from a colocated ECR
6+
7+
Configure your cluster and APIs to use images from ECR in the same region as your cluster to accelerate scale-ups, reduce ingress costs, and remove the dependency on Cortex's public quay.io registry.
8+
9+
You can find instructions for mirroring Cortex images [here](../advanced/self-hosted-images.md)
10+
11+
## Handling Cortex updates/upgrades
12+
13+
Use a Route 53 hosted zone as a proxy in front of your Cortex cluster. Every new Cortex cluster provisions a new API load balancer with a unique endpoint. Using a Route 53 hosted zone configured with a subdomain will expose your Cortex cluster API endpoint as a static endpoint (e.g. `cortex.your-company.com`). You will be able to upgrade Cortex versions without downtime, and you will avoid the need to updated your client code every time you migrate to a new cluster. You can find instructions for setting up a custom domain with a Route 53 hosted zone [here](../networking/custom-domain.md), and instructions for updating/upgrading your cluster [here](update.md).
14+
15+
## Production cluster configuration
16+
17+
### Securing your cluster
18+
19+
The following configuration will improve security by preventing your cluster's nodes from being publicly accessible.
20+
21+
```yaml
22+
subnet_visibility: private
23+
24+
nat_gateway: single # use "highly_available" for large clusters making requests to services outside of the cluster
25+
```
26+
27+
You can make your load balancer private to prevent your APIs from being publicly accessed. In order to access your APIs, you will need to set up VPC peering between the Cortex cluster's VPC and the VPC containing the consumers of the Cortex APIs. See the [VPC peering guide](../networking/vpc-peering.md) for more details.
28+
29+
```yaml
30+
api_load_balancer_scheme: internal
31+
```
32+
33+
You can also restrict access to your load balancers by IP address:
34+
35+
```yaml
36+
api_load_balancer_cidr_white_list: [0.0.0.0/0]
37+
```
38+
39+
These two fields are also available for the operator load balancer. Keep in mind that if you make the operator load balancer private, you'll need to configure VPC peering to use the `cortex` CLI or Python client.
40+
41+
```yaml
42+
operator_load_balancer_scheme: internal
43+
operator_load_balancer_cidr_white_list: [0.0.0.0/0]
44+
```
45+
46+
See [here](../networking/load-balancers.md) for more information about the load balancers.
47+
48+
### Ensure node provisioning
49+
50+
You can take advantage of the cost savings of spot instances and the reliability of on-demand instances by utilizing the `priority` field in node groups. You can deploy two node groups, one that is spot and another that is on-demand. Set the priority of the spot node group to be higher than the priority of the on-demand node group. This encourages the cluster-autoscaler to try to spin up instances from the spot node group first. If there are no more spot instances available, the on-demand node group will be used instead.
51+
52+
```yaml
53+
node_groups:
54+
- name: gpu-spot
55+
instance_type: g4dn.xlarge
56+
min_instances: 0
57+
max_instances: 5
58+
spot: true
59+
priority: 100
60+
- name: gpu-on-demand
61+
instance_type: g4dn.xlarge
62+
min_instances: 0
63+
max_instances: 5
64+
priority: 1
65+
```
66+
67+
### Considerations for large clusters
68+
69+
If you plan on scaling your Cortex cluster past 400 nodes or 800 pods, it is recommended to set `prometheus_instance_type` to a larger instance type. A good guideline is that a t3.medium instance can reliably handle 400 nodes and 800 pods.
70+
71+
## API Spec
72+
73+
### Container design
74+
75+
Configure your health checks to be as accurate as possible to prevent requests from being routed to pods that aren't ready to handle traffic.
76+
77+
### Pods section
78+
79+
Make sure that `max_concurrency` is set to match the concurrency supported by your container.
80+
81+
Tune `max_queue_length` to lower values if you would like to more aggressively redistribute requests to newer pods as your API scales up rather than allowing requests to linger in queues. This would mean that the clients consuming your APIs should implement retry logic with a delay (such as exponential backoff).
82+
83+
### Compute section
84+
85+
Make sure to specify all of the relevant compute resources (especially cpu and memory) to ensure that your pods aren't starved for resources.
86+
87+
### Autoscaling
88+
89+
Revisit the autoscaling docs for [Realtime APIs](../../workloads/realtime/autoscaling.md) and/or [Async APIs](../../workloads/async/autoscaling.md) to effectively handle production traffic by tuning the scaling rate, sensitivity, and over-provisioning.

docs/clusters/management/update.md

+98-20
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,114 @@
11
# Update
22

3-
## Update node group size
3+
## Modify existing cluster
4+
5+
You can add or remove node groups, resize existing node groups, and update some configuration fields of a running cluster.
6+
7+
Fetch the current cluster configuration:
48

59
```bash
6-
cortex cluster scale --node-group <node-group-name> --min-instances <min-instances> --max-instances <max-instances>
10+
cortex cluster info --print-config --name CLUSTER_NAME --region REGION > cluster.yaml
711
```
812

9-
## Upgrade to a newer version
13+
Make your desired changes, and then apply them:
1014

1115
```bash
12-
# spin down your cluster
13-
cortex cluster down --name <name> --region <region>
16+
cortex cluster configure cluster.yaml
17+
```
18+
19+
Cortex will calculate the difference and you will be prompted with the update plan.
20+
21+
If you would like to update fields that cannot be modified on a running cluster, you must create a new cluster with your desired configuration.
22+
23+
## Upgrade to a new version
24+
25+
Updating an existing Cortex cluster is not supported at the moment. Please spin down the previous version of the cluster, install the latest version of the Cortex CLI, and use it to spin up a new Cortex cluster. See the next section for how to do this without downtime.
26+
27+
## Update or upgrade without downtime
28+
29+
It is possible to update to a new version Cortex or to migrate from one cluster to another without downtime.
30+
31+
Note: it is important to not spin down your previous cluster until after your new cluster is receiving traffic.
32+
33+
### Set up a subdomain using a Route 53 hosted zone
34+
35+
If you've already set up a subdomain with a Route 53 hosted zone pointing to your cluster, skip this step.
36+
37+
Setting up a Route 53 hosted zone allows you to transfer traffic seamlessly from from an existing cluster to a new cluster, thereby avoiding downtime. You can find the instructions for setting up a subdomain [here](../networking/custom-domain.md). You will need to update any clients interacting with your Cortex APIs to point to the new subdomain.
1438

15-
# update your CLI to the latest version
16-
pip install --upgrade cortex
39+
### Export all APIs from your previous cluster
1740

18-
# confirm version
41+
The `cluster export` command can be used to get the YAML specifications of all APIs deployed in your cluster:
42+
43+
```bash
44+
cortex cluster export --name <previous_cluster_name> --region <region>
45+
```
46+
47+
### Spin up a new cortex cluster
48+
49+
If you are creating a new cluster with the same Cortex version:
50+
51+
```bash
52+
cortex cluster up new-cluster.yaml --configure-env cortex2
53+
```
54+
55+
This will create a CLI environment named `cortex2` for accessing the new cluster.
56+
57+
If you are spinning a up a new cluster with a different Cortex version, first install the cortex CLI matching the desired cluster version:
58+
59+
```bash
60+
# download the desired CLI version, replace 0.38.0 with the desired version (Note the "v"):
61+
bash -c "$(curl -sS https://github.com/raw/cortexlabs/cortex/v0.38.0/get-cli.sh)"
62+
63+
# confirm Cortex CLI version
1964
cortex version
2065

21-
# spin up your cluster
22-
cortex cluster up cluster.yaml
66+
# spin up your cluster using the new CLI version
67+
cortex cluster up cluster.yaml --configure-env cortex2
68+
```
69+
70+
You can use different Cortex CLIs to interact with the different versioned clusters; here is an example:
71+
72+
```bash
73+
# download the desired CLI version, replace 0.38.0 with the desired version (Note the "v"):
74+
CORTEX_INSTALL_PATH=$(pwd)/cortex0.38.0 bash -c "$(curl -sS https://github.com/raw/cortexlabs/cortex/v0.38.0/get-cli.sh)"
75+
76+
# confirm cortex CLI version
77+
./cortex0.38.0 version
78+
```
79+
80+
### Deploy the APIs to your new cluster
81+
82+
Please read the [changelogs](https://github.com/cortexlabs/cortex/releases) and the latest documentation to identify any features and breaking changes in the new version. You may need to make modifications to your cluster and/or API configuration files.
83+
84+
```bash
85+
cortex deploy -e cortex2 <api_spec_file>
86+
```
87+
88+
After you've updated the API specifications and images if necessary, you can deploy them onto your new cluster.
89+
90+
### Point your custom domain to your new cluster
91+
92+
Verify that all of the APIs in your new cluster are working as expected by accessing via the cluster's API load balancer URL.
93+
94+
Get the cluster's API load balancer URL:
95+
96+
```bash
97+
cortex cluster info --name <new_cluster_name> --region <region>
2398
```
2499

25-
## Upgrade without downtime
100+
Once the APIs on the new cluster have been verified as working properly, it is recommended to update `min_replicas` of your APIs on the new cluster to match the current values in your previous cluster. This will avoid large sudden scale-up events as traffic is shifted to the new cluster.
26101

27-
In production environments, you can upgrade your cluster without downtime if you have a backend service or DNS in front of your Cortex cluster:
102+
Then, navigate to the A record in your custom domains's Route 53 hosted zone and update the Alias to point the new cluster's API load balancer URL. Rather than suddenly routing all of your traffic from the previous cluster to the new cluster, you can use [weighted records](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy.html#routing-policy-weighted) to incrementally route more traffic to your new cluster.
28103

29-
1. Spin up a new cluster. For example: `cortex cluster up new-cluster.yaml --configure-env cortex2` (this will create a CLI environment named `cortex2` for accessing the new cluster).
30-
1. Re-deploy your APIs in your new cluster. For example, if the name of your CLI environment for your existing cluster is `cortex`, you can use `cortex get --env cortex` to list all running APIs in your cluster, and re-deploy them in the new cluster by running `cortex deploy --env cortex2` for each API. Alternatively, you can run `cortex cluster export --name <previous_cluster_name> --region <region>` to export the API specifications for all of your running APIs, change directories the folder that was exported, and run `cortex deploy --env cortex2 <file_name>` for each API that you want to deploy in the new cluster.
31-
1. Route requests to your new cluster.
32-
* If you are using a custom domain: update the A record in your Route 53 hosted zone to point to your new cluster's API load balancer.
33-
* If you have a backend service which makes requests to Cortex: update your backend service to make requests to the new cluster's endpoints.
34-
* If you have a self-managed API Gateway in front of your Cortex cluster: update the routes to use new cluster's endpoints.
35-
1. Spin down your previous cluster. If you updated DNS settings, wait 24-48 hours before spinning down your previous cluster to allow the DNS cache to be flushed.
36-
1. You may now rename your new CLI environment name if you'd like (e.g. to rename it back to "cortex": `cortex env rename cortex2 cortex`)
104+
If you increased `min_replicas` for your APIs in the new cluster during the transition, you may reduce `min_replicas` back to your desired level once all traffic has been shifted.
105+
106+
### Spin down the previous cluster
107+
108+
After confirming that your previous cluster has completed servicing all existing traffic and is not receiving any new traffic, spin down your previous cluster:
109+
110+
```bash
111+
# Note: it is recommended to install the Cortex CLI matching the previous cluster's version to ensure proper deletion.
112+
113+
cortex cluster down --name <previous_cluster_name> --region <region>
114+
```

0 commit comments

Comments
 (0)