cortexproject · pracucci · Aug 10, 2020 · Aug 7, 2020 · Aug 7, 2020 · Aug 10, 2020
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,11 @@
 
 ## master / unreleased
 
+* [CHANGE] Experimental blocks storage: removed the support to transfer blocks between ingesters on shutdown. When running the Cortex blocks storage, ingesters are expected to run with a persistent disk. The following metrics have been removed: #2996
+  * `cortex_ingester_sent_files`
+  * `cortex_ingester_received_files`
+  * `cortex_ingester_received_bytes_total`
+  * `cortex_ingester_sent_bytes_total`
 * [ENHANCEMENT] Query-tee: added a small tolerance to floating point sample values comparison. #2994
 
 ## 1.3.0 in progress

diff --git a/development/tsdb-blocks-storage-s3-single-binary/config/cortex.yaml b/development/tsdb-blocks-storage-s3-single-binary/config/cortex.yaml
@@ -13,8 +13,6 @@ ingester_client:
     use_gzip_compression: true
 
 ingester:
-  max_transfer_retries: 1
-
   lifecycler:
     # We want to start immediately.
     join_after: 0

diff --git a/development/tsdb-blocks-storage-s3/config/cortex.yaml b/development/tsdb-blocks-storage-s3/config/cortex.yaml
@@ -13,8 +13,6 @@ ingester_client:
     use_gzip_compression: true
 
 ingester:
-  max_transfer_retries: 1
-
   lifecycler:
     # We want to start immediately.
     join_after: 0

diff --git a/docs/architecture.md b/docs/architecture.md
@@ -153,17 +153,18 @@ Incoming series are not immediately written to the storage but kept in memory an
 
 Ingesters contain a **lifecycler** which manages the lifecycle of an ingester and stores the **ingester state** in the [hash ring](#the-hash-ring). Each ingester could be in one of the following states:
 
-1. `PENDING` is an ingester's state when it just started and is waiting for a hand-over from another ingester that is `LEAVING`. If no hand-over occurs within the configured timeout period ("auto-join timeout", configurable via `-ingester.join-after` option), the ingester will join the ring with a new set of random tokens (ie. during a scale up). When hand-over process starts, state changes to `JOINING`.
-
-2. `JOINING` is an ingester's state in two situations. First, ingester will switch to a `JOINING` state from `PENDING` state after auto-join timeout. In this case, ingester will generate tokens, store them into the ring, optionally observe the ring for token conflicts and then move to `ACTIVE` state. Second, ingester will also switch into a `JOINING` state as a result of another `LEAVING` ingester initiating a hand-over process with `PENDING` (which then switches to `JOINING` state). `JOINING` ingester then receives series and tokens from `LEAVING` ingester, and if everything goes well, `JOINING` ingester switches to `ACTIVE` state. If hand-over process fails, `JOINING` ingester will move back to `PENDING` state and either wait for another hand-over or auto-join timeout.
-
-3. `ACTIVE` is an ingester's state when it is fully initialized. It may receive both write and read requests for tokens it owns.
-
-4. `LEAVING` is an ingester's state when it is shutting down. It cannot receive write requests anymore, while it could still receive read requests for series it has in memory. While in this state, the ingester may look for a `PENDING` ingester to start a hand-over process with, used to transfer the state from `LEAVING` ingester to the `PENDING` one, during a rolling update (`PENDING` ingester moves to `JOINING` state during hand-over process). If there is no new ingester to accept hand-over, ingester in `LEAVING` state will flush data to storage instead.
-
-5. `UNHEALTHY` is an ingester's state when it has failed to heartbeat to the ring's KV Store. While in this state, distributors skip the ingester while building the replication set for incoming series and the ingester does not receive write or read requests.
-
-For more information about the hand-over process, please check out the [Ingester hand-over](guides/ingester-handover.md) documentation.
+- **`PENDING`**<br />
+  The ingester has just started. While in this state, the ingester doesn't receive neither write and read requests, and could be waiting for time series data transfer from another ingester if running the chunks storage and the [hand-over](guides/ingesters-rolling-updates.md#chunks-storage-with-wal-disabled-hand-over) is enabled.
+- **`JOINING`**<br />
+  The ingester is starting up and joining the ring. While in this state the ingester doesn't receive neither write and read requests. The ingester will join the ring using tokens received by a leaving ingester as part of the [hand-over](guides/ingesters-rolling-updates.md#chunks-storage-with-wal-disabled-hand-over) process (if enabled), otherwise it could load tokens from disk (if `-ingester.tokens-file-path` is configured) or generate a set of new random ones. Finally, the ingester optionally observes the ring for tokens conflicts and then, once any conflict is resolved, will move to `ACTIVE` state.
+- **`ACTIVE`**<br />
+  The ingester is up and running. While in this state the ingester can receive both write and read requests.
+- **`LEAVING`**<br />
+  The ingester is shutting down and leaving the ring. While in this state the ingester doesn't receive write requests, while it could receive read requests.
+- **`UNHEALTHY`**<br />
+  The ingester has failed to heartbeat to the ring's KV Store. While in this state, distributors skip the ingester while building the replication set for incoming series and the ingester does not receive write or read requests.
+
+_The ingester states are interally used for different purposes, including the series hand-over process supported by the chunks storage. For more information about it, please check out the [Ingester hand-over](guides/ingesters-rolling-updates.md#chunks-storage-with-wal-disabled-hand-over) documentation._
 
 Ingesters are **semi-stateful**.
 

diff --git a/docs/blocks-storage/querier.md b/docs/blocks-storage/querier.md
@@ -475,9 +475,8 @@ blocks_storage:
     # CLI flag: -experimental.blocks-storage.tsdb.wal-compression-enabled
     [wal_compression_enabled: <boolean> | default = false]
 
-    # If true, and transfer of blocks on shutdown fails or is disabled,
-    # incomplete blocks are flushed to storage instead. If false, incomplete
-    # blocks will be reused after restart, and uploaded when finished.
+    # True to flush blocks to storage on shutdown. If false, incomplete blocks
+    # will be reused after restart.
     # CLI flag: -experimental.blocks-storage.tsdb.flush-blocks-on-shutdown
     [flush_blocks_on_shutdown: <boolean> | default = false]
 

diff --git a/docs/blocks-storage/store-gateway.md b/docs/blocks-storage/store-gateway.md
@@ -502,9 +502,8 @@ blocks_storage:
     # CLI flag: -experimental.blocks-storage.tsdb.wal-compression-enabled
     [wal_compression_enabled: <boolean> | default = false]
 
-    # If true, and transfer of blocks on shutdown fails or is disabled,
-    # incomplete blocks are flushed to storage instead. If false, incomplete
-    # blocks will be reused after restart, and uploaded when finished.
+    # True to flush blocks to storage on shutdown. If false, incomplete blocks
+    # will be reused after restart.
     # CLI flag: -experimental.blocks-storage.tsdb.flush-blocks-on-shutdown
     [flush_blocks_on_shutdown: <boolean> | default = false]
 

diff --git a/docs/configuration/arguments.md b/docs/configuration/arguments.md
@@ -306,11 +306,11 @@ It also talks to a KVStore and has it's own copies of the same flags used by the
 
 - `-ingester.join-after`
 
-   How long to wait in PENDING state during the [hand-over process](../guides/ingester-handover.md). (default 0s)
+   How long to wait in PENDING state during the [hand-over process](../guides/ingesters-rolling-updates.md#chunks-storage-with-wal-disabled-hand-over) (supported only by the chunks storage). (default 0s)
 
 - `-ingester.max-transfer-retries`
 
-   How many times a LEAVING ingester tries to find a PENDING ingester during the [hand-over process](../guides/ingester-handover.md). Each attempt takes a second or so. Negative value or zero disables hand-over process completely. (default 10)
+   How many times a LEAVING ingester tries to find a PENDING ingester during the [hand-over process](../guides/ingesters-rolling-updates.md#chunks-storage-with-wal-disabled-hand-over) (supported only by the chunks storage). Negative value or zero disables hand-over process completely. (default 10)
 
 - `-ingester.normalise-tokens`
 

diff --git a/docs/configuration/config-file-reference.md b/docs/configuration/config-file-reference.md
@@ -543,7 +543,8 @@ lifecycler:
   [availability_zone: <string> | default = ""]
 
 # Number of times to try and transfer chunks before falling back to flushing.
-# Negative value or zero disables hand-over.
+# Negative value or zero disables hand-over. This feature is supported only by
+# the chunks storage.
 # CLI flag: -ingester.max-transfer-retries
 [max_transfer_retries: <int> | default = 10]
 
@@ -3263,9 +3264,8 @@ tsdb:
   # CLI flag: -experimental.blocks-storage.tsdb.wal-compression-enabled
   [wal_compression_enabled: <boolean> | default = false]
 
-  # If true, and transfer of blocks on shutdown fails or is disabled, incomplete
-  # blocks are flushed to storage instead. If false, incomplete blocks will be
-  # reused after restart, and uploaded when finished.
+  # True to flush blocks to storage on shutdown. If false, incomplete blocks
+  # will be reused after restart.
   # CLI flag: -experimental.blocks-storage.tsdb.flush-blocks-on-shutdown
   [flush_blocks_on_shutdown: <boolean> | default = false]
 

diff --git a/docs/configuration/single-process-config-blocks-gossip-1.yaml b/docs/configuration/single-process-config-blocks-gossip-1.yaml
@@ -29,9 +29,6 @@ ingester_client:
     use_gzip_compression: true
 
 ingester:
-  # Disable blocks transfers on ingesters shutdown or rollout.
-  max_transfer_retries: 0
-
   lifecycler:
     # The address to advertise for this ingester.  Will be autodiscovered by
     # looking up address on eth0 or en0; can be specified if this fails.

diff --git a/docs/configuration/single-process-config-blocks-gossip-2.yaml b/docs/configuration/single-process-config-blocks-gossip-2.yaml
@@ -29,9 +29,6 @@ ingester_client:
     use_gzip_compression: true
 
 ingester:
-  # Disable blocks transfers on ingesters shutdown or rollout.
-  max_transfer_retries: 0
-
   lifecycler:
     # The address to advertise for this ingester.  Will be autodiscovered by
     # looking up address on eth0 or en0; can be specified if this fails.

diff --git a/docs/configuration/single-process-config-blocks-tls.yaml b/docs/configuration/single-process-config-blocks-tls.yaml
@@ -37,9 +37,6 @@ ingester_client:
     tls_ca_path: "root.crt"
 
 ingester:
-  # Disable blocks transfers on ingesters shutdown or rollout.
-  max_transfer_retries: 0
-
   lifecycler:
     # The address to advertise for this ingester.  Will be autodiscovered by
     # looking up address on eth0 or en0; can be specified if this fails.

diff --git a/docs/configuration/single-process-config-blocks.yaml b/docs/configuration/single-process-config-blocks.yaml
@@ -28,9 +28,6 @@ ingester_client:
     use_gzip_compression: true
 
 ingester:
-  # Disable blocks transfers on ingesters shutdown or rollout.
-  max_transfer_retries: 0
-
   lifecycler:
     # The address to advertise for this ingester.  Will be autodiscovered by
     # looking up address on eth0 or en0; can be specified if this fails.

diff --git a/docs/guides/ingester-handover.md b/docs/guides/ingester-handover.md
diff --git a/docs/guides/ingesters-rolling-updates.md b/docs/guides/ingesters-rolling-updates.md
@@ -0,0 +1,90 @@
+---
+title: "Ingesters rolling updates"
+linkTitle: "Ingesters rolling updates"
+weight: 102
+slug: ingesters-rolling-updates
+---
+
+Cortex [ingesters](architecture.md#ingester) are semi-stateful.
+A running ingester holds several hours of time series data in memory, before they're flushed to the long-term storage.
+When an ingester shutdowns, because of a rolling update or maintenance, the in-memory data must not be discarded in order to avoid any data loss.
+
+In this document we describe the techniques employed to safely handle rolling updates, based on different setups:
+
+- [Blocks storage](#blocks-storage)
+- [Chunks storage with WAL enabled](#chunks-storage-with-wal-enabled)
+- [Chunks storage with WAL disabled](#chunks-storage-with-wal-disabled-hand-over)
+
+## Blocks storage
+
+The Cortex [blocks storage](../blocks-storage/_index.md) requires ingesters to run with a persistent disk where the TSDB WAL and blocks are stored (eg. a StatefulSet when deployed on Kubernetes).
+
+During a rolling update, the leaving ingester closes the open TSDBs, synchronize the data to disk (`fsync`) and releases the disk resources.
+The new ingester, which is expected to reuse the same disk of the leaving one, will replay the TSDB WAL on startup in order to load back in memory the time series that have not been compacted into a block yet.
+
+_The blocks storage doesn't support the series [hand-over](#chunks-storage-with-wal-disabled-hand-over)._
+
+## Chunks storage
+
+The Cortex chunks storage optionally supports a write-ahead log (WAL).
+The rolling update procedure for a Cortex cluster running the chunks storage depends whether the WAL is enabled or not.
+
+### Chunks storage with WAL enabled
+
+Similarly to the blocks storage, when Cortex is running the chunks storage with WAL enabled, it requires ingesters to run with a persistent disk where the WAL is stored (eg. a StatefulSet when deployed on Kubernetes).
+
+During a rolling update, the leaving ingester closes the WAL, synchronize the data to disk (`fsync`) and releases the disk resources.
+The new ingester, which is expected to reuse the same disk of the leaving one, will replay the WAL on startup in order to load back in memory the time series data.
+
+_For more information about the WAL, please refer to [Ingesters with WAL](../production/ingesters-with-wal.md)._
+
+### Chunks storage with WAL disabled (hand-over)
+
+When Cortex is running the chunks storage with WAL disabled, Cortex supports on-the-fly series hand-over between a leaving ingester and a joining one.
+
+The hand-over is based on the ingesters state stored in the ring. Each ingester could be in one of the following **states**:
+
+- `PENDING`
+- `JOINING`
+- `ACTIVE`
+- `LEAVING`
+
+On startup, an ingester goes into the **`PENDING`** state.
+In this state, the ingester is waiting for a hand-over from another ingester that is `LEAVING`.
+If no hand-over occurs within the configured timeout period ("auto-join timeout", configurable via `-ingester.join-after` option), the ingester will join the ring with a new set of random tokens (eg. during a scale up) and will switch its state to `ACTIVE`.
+
+When a running ingester in the **`ACTIVE`** state is notified to shutdown via `SIGINT` or `SIGTERM` Unix signal, the ingester switches to `LEAVING` state. In this state it cannot receive write requests anymore, but it can still receive read requests for series it has in memory.
+
+A **`LEAVING`** ingester looks for a `PENDING` ingester to start a hand-over process with.
+If it finds one, that ingester goes into the `JOINING` state and the leaver transfers all its in-memory data over to the joiner.
+On successful transfer the leaver removes itself from the ring and exits, while the joiner changes its state to `ACTIVE`, taking over ownership of the leaver's [ring tokens](../architecture.md#hashing). As soon as the joiner switches it state to `ACTIVE`, it will start receive both write requests from distributors and queries from queriers.
+
+If the `LEAVING` ingester does not find a `PENDING` ingester after `-ingester.max-transfer-retries` retries, it will flush all of its chunks to the long-term storage, then removes itself from the ring and exits. The chunks flushing to the storage may take several minutes to complete.
+
+#### Higher number of series / chunks during rolling updates
+
+During hand-over, neither the leaving nor joining ingesters will
+accept new samples. Distributors are aware of this, and "spill" the
+samples to the next ingester in the ring. This creates a set of extra
+"spilled" series and chunks which will idle out and flush after hand-over is
+complete.
+
+#### Observability
+
+The following metrics can be used to observe this process:
+
+- **`cortex_member_ring_tokens_owned`**<br />
+  How many tokens each ingester thinks it owns.
+- **`cortex_ring_tokens_owned`**<br />
+  How many tokens each ingester is seen to own by other components.
+- **`cortex_ring_member_ownership_percent`**<br />
+  Same as `cortex_ring_tokens_owned` but expressed as a percentage.
+- **`cortex_ring_members`**<br />
+  How many ingesters can be seen in each state, by other components.
+- **`cortex_ingester_sent_chunks`**<br />
+  Number of chunks sent by leaving ingester.
+- **`cortex_ingester_received_chunks`**<br />
+  Number of chunks received by joining ingester.
+
+You can see the current state of the ring via http browser request to
+`/ring` on a distributor.
diff --git a/integration/ingester_hand_over_test.go b/integration/ingester_hand_over_test.go
@@ -16,13 +16,6 @@ import (
 	"github.com/cortexproject/cortex/integration/e2ecortex"
 )
 
-func TestIngesterHandOverWithBlocksStorage(t *testing.T) {
-	runIngesterHandOverTest(t, BlocksStorageFlags, func(t *testing.T, s *e2e.Scenario) {
-		minio := e2edb.NewMinio(9000, BlocksStorageFlags["-experimental.blocks-storage.s3.bucket-name"])
-		require.NoError(t, s.StartAndWaitReady(minio))
-	})
-}
-
 func TestIngesterHandOverWithChunksStorage(t *testing.T) {
 	runIngesterHandOverTest(t, ChunksStorageFlags, func(t *testing.T, s *e2e.Scenario) {
 		dynamo := e2edb.NewDynamoDB()
@@ -77,9 +70,7 @@ func runIngesterHandOverTest(t *testing.T, flags map[string]string, setup func(t
 	assert.Equal(t, expectedVector, result.(model.Vector))
 
 	// Ensure 1st ingester metrics are tracked correctly.
-	if flags["-store.engine"] != blocksStorageEngine {
-		require.NoError(t, ingester1.WaitSumMetrics(e2e.Equals(1), "cortex_ingester_chunks_created_total"))
-	}
+	require.NoError(t, ingester1.WaitSumMetrics(e2e.Equals(1), "cortex_ingester_chunks_created_total"))
 
 	// Start ingester-2.
 	ingester2 := e2ecortex.NewIngester("ingester-2", consul.NetworkHTTPEndpoint(), mergeFlags(flags, map[string]string{