You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add concurrency to dequeuer
* Fix dequeuer test
* Address golangci-lint issues
* Add max concurrency validation for async api kinds
* Pass-in argument for number of workers to dequeuer sidecar
* Add max concurrency to async configuration.md docs
* Return nil when worker shuts down
* Update default value for target_in_flight and docs
Co-authored-by: Robert Lucian Chiriac <[email protected]>
Co-authored-by: David Eliahu <[email protected]>
Copy file name to clipboardExpand all lines: docs/workloads/async/autoscaling.md
+12-4
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,14 @@ Cortex auto-scales AsyncAPIs on a per-API basis based on your configuration.
4
4
5
5
## Autoscaling replicas
6
6
7
+
### Relevant pod configuration
8
+
9
+
In addition to the autoscaling configuration options (described below), there is one field in the pod configuration which is relevant to replica autoscaling:
10
+
11
+
**`max_concurrency`** (default: 1): The maximum number of requests that will be concurrently sent into the container by Cortex. If your web server is designed to handle multiple concurrent requests, increasing `max_concurrency` will increase the throughput of a replica (and result in fewer total replicas for a given load).
12
+
13
+
<br>
14
+
7
15
### Autoscaling configuration
8
16
9
17
**`min_replicas`** (default: 1): The lower bound on how many replicas can be running for an API. Scale-to-zero is supported.
@@ -14,13 +22,13 @@ Cortex auto-scales AsyncAPIs on a per-API basis based on your configuration.
14
22
15
23
<br>
16
24
17
-
**`target_in_flight`** (default: 1): This is the desired number of in-flight requests per replica, and is the metric which the autoscaler uses to make scaling decisions. The number of in-flight requests is simply how many requests have been submitted and are not yet finished being processed. Therefore, this number includes requests which are actively being processed as well as requests which are waiting in the queue.
25
+
**`target_in_flight`** (default: `max_concurrency` in the pod configuration): This is the desired number of in-flight requests per replica, and is the metric which the autoscaler uses to make scaling decisions. The number of in-flight requests is simply how many requests have been submitted and are not yet finished being processed. Therefore, this number includes requests which are actively being processed as well as requests which are waiting in the queue.
18
26
19
27
The autoscaler uses this formula to determine the number of desired replicas:
20
28
21
29
`desired replicas = total in-flight requests / target_in_flight`
22
30
23
-
For example, setting `target_in_flight` to 1 (the default) causes the cluster to adjust the number of replicas so that on average, there are no requests waiting in the queue.
31
+
For example, setting `target_in_flight` to `max_concurrency` (the default) causes the cluster to adjust the number of replicas so that on average, there are no requests waiting in the queue.
24
32
25
33
<br>
26
34
@@ -58,9 +66,9 @@ Cortex spins up and down instances based on the aggregate resource requests of a
58
66
59
67
## Overprovisioning
60
68
61
-
The default value for `target_in_flight` is 1, which behaves well in many situations (see above for an explanation of how `target_in_flight` affects autoscaling). However, if your application is sensitive to spikes in traffic or if creating new replicas takes too long (see below), you may find it helpful to maintain extra capacity to handle the increased traffic while new replicas are being created. This can be accomplished by setting `target_in_flight` to a lower value. The smaller `target_in_flight` is, the more unused capacity your API will have, and the more room it will have to handle sudden increased load. The increased request rate will still trigger the autoscaler, and your API will stabilize again (maintaining the overprovisioned capacity).
69
+
The default value for `target_in_flight` is `max_concurrency`, which behaves well in many situations (see above for an explanation of how `target_in_flight` affects autoscaling). However, if your application is sensitive to spikes in traffic or if creating new replicas takes too long (see below), you may find it helpful to maintain extra capacity to handle the increased traffic while new replicas are being created. This can be accomplished by setting `target_in_flight` to a lower value relative to the expected replica's concurrency. The smaller `target_in_flight` is, the more unused capacity your API will have, and the more room it will have to handle sudden increased load. The increased request rate will still trigger the autoscaler, and your API will stabilize again (maintaining the overprovisioned capacity).
62
70
63
-
For example, if you wanted to overprovision by 25%, you could set `target_in_flight` to 0.8. If your API has an average of 4 concurrent requests, the autoscaler would maintain 5 live replicas (4/0.8 = 5).
71
+
For example, if you've determined that each replica in your API can efficiently handle 2 concurrent requests, you would typically set `target_in_flight` to 2. In a scenario where your API is receiving 8 concurrent requests on average, the autoscaler would maintain 4 live replicas (8/2 = 4). If you wanted to overprovision by 25%, you could set `target_in_flight` to 1.6, causing the autoscaler maintain 5 live replicas (8/1.6 = 5).
Copy file name to clipboardExpand all lines: docs/workloads/async/configuration.md
+2-1
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,7 @@
5
5
kind: AsyncAPI # must be "AsyncAPI" for async APIs (required)
6
6
pod: # pod configuration (required)
7
7
port: <int> # port to which requests will be sent (default: 8080; exported as $CORTEX_PORT)
8
+
max_concurrency: <int> # maximum number of requests that will be concurrently sent into the container (default: 1, max allowed: 100)
8
9
containers: # configurations for the containers to run (at least one constainer must be provided)
9
10
- name: <string> # name of the container (required)
10
11
image: <string> # docker image to use for the container (required)
@@ -45,7 +46,7 @@
45
46
min_replicas: <int> # minimum number of replicas (default: 1; min value: 0)
46
47
max_replicas: <int> # maximum number of replicas (default: 100)
47
48
init_replicas: <int> # initial number of replicas (default: <min_replicas>)
48
-
target_in_flight: <float> # desired number of in-flight requests per replica (including requests actively being processed as well as queued), which the autoscaler tries to maintain (default: 1)
49
+
target_in_flight: <float> # desired number of in-flight requests per replica (including requests actively being processed as well as queued), which the autoscaler tries to maintain (default: <max_concurrency>)
49
50
window: <duration> # duration over which to average the API's in-flight requests per replica (default: 60s)
50
51
downscale_stabilization_period: <duration> # the API will not scale below the highest recommendation made during this period (default: 5m)
51
52
upscale_stabilization_period: <duration> # the API will not scale above the lowest recommendation made during this period (default: 1m)
Copy file name to clipboardExpand all lines: docs/workloads/realtime/autoscaling.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -72,7 +72,7 @@ Cortex spins up and down instances based on the aggregate resource requests of a
72
72
73
73
The default value for `target_in_flight` is `max_concurrency`, which behaves well in many situations (see above for an explanation of how `target_in_flight` affects autoscaling). However, if your application is sensitive to spikes in traffic or if creating new replicas takes too long (see below), you may find it helpful to maintain extra capacity to handle the increased traffic while new replicas are being created. This can be accomplished by setting `target_in_flight` to a lower value relative to the expected replica's concurrency. The smaller `target_in_flight` is, the more unused capacity your API will have, and the more room it will have to handle sudden increased load. The increased request rate will still trigger the autoscaler, and your API will stabilize again (maintaining the overprovisioned capacity).
74
74
75
-
For example, if you've determined that each replica in your API can handle 2 concurrent requests, you would typically set `target_in_flight` to 2. In a scenario where your API is receiving 8 concurrent requests on average, the autoscaler would maintain 4 live replicas (8/2 = 4). If you wanted to overprovision by 25%, you could set `target_in_flight` to 1.6, causing the autoscaler maintain 5 live replicas (8/1.6 = 5).
75
+
For example, if you've determined that each replica in your API can efficiently handle 2 concurrent requests, you would typically set `target_in_flight` to 2. In a scenario where your API is receiving 8 concurrent requests on average, the autoscaler would maintain 4 live replicas (8/2 = 4). If you wanted to overprovision by 25%, you could set `target_in_flight` to 1.6, causing the autoscaler maintain 5 live replicas (8/1.6 = 5).
0 commit comments