Skip to content

20190115 ruler better flags #1987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 21, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@ If you are running with a high `-ruler.num-workers` and if you're not able to ex
Further, if you're using the configs service, we've upgraded the migration library and this requires some manual intervention. See full
instructions below to upgrade your Postgres.

* [CHANGE] Remove unnecessary configs/flags from the ruler ring config to align with the pattern used in the distributor ring. #1987
* Ruler ring related flags are now all prefixed with `ruler.ring.` as opposed to just `ruler.`
* Changed the default value for `-ruler.ring.prefix` from `collectors/` to `rulers/` in order to not clash with other keys (ie. ring) stored in the same key-value store.
* [CHANGE] The frontend component now does not cache results if it finds a `Cache-Control` header and if one of its values is `no-store`. #1974
* [CHANGE] Flags changed with transition to upstream Prometheus rules manager:
* `ruler.client-timeout` is now `ruler.configs.client-timeout` in order to match `ruler.configs.url`
Expand Down
110 changes: 39 additions & 71 deletions docs/configuration/config-file-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -679,91 +679,59 @@ alertmanagerurl:
# CLI flag: -ruler.search-pending-for
[searchpendingfor: <duration> | default = 5m0s]

lifecyclerconfig:
ring:
kvstore:
# Backend storage to use for the ring. Supported values are: consul, etcd,
# inmemory, multi, memberlist (experimental).
# CLI flag: -ruler.store
[store: <string> | default = "consul"]
ring:
kvstore:
# Backend storage to use for the ring. Supported values are: consul, etcd,
# inmemory, multi, memberlist (experimental).
# CLI flag: -ruler.ring.store
[store: <string> | default = "consul"]

# The prefix for the keys in the store. Should end with a /.
# CLI flag: -ruler.prefix
[prefix: <string> | default = "collectors/"]
# The prefix for the keys in the store. Should end with a /.
# CLI flag: -ruler.ring.prefix
[prefix: <string> | default = "rulers/"]

# The consul_config configures the consul client.
# The CLI flags prefix for this block config is: ruler
[consul: <consul_config>]
# The consul_config configures the consul client.
# The CLI flags prefix for this block config is: ruler.ring
[consul: <consul_config>]

# The etcd_config configures the etcd client.
# The CLI flags prefix for this block config is: ruler
[etcd: <etcd_config>]
# The etcd_config configures the etcd client.
# The CLI flags prefix for this block config is: ruler.ring
[etcd: <etcd_config>]

# The memberlist_config configures the Gossip memberlist.
# The CLI flags prefix for this block config is: ruler
[memberlist: <memberlist_config>]
# The memberlist_config configures the Gossip memberlist.
# The CLI flags prefix for this block config is: ruler.ring
[memberlist: <memberlist_config>]

multi:
# Primary backend storage used by multi-client.
# CLI flag: -ruler.multi.primary
[primary: <string> | default = ""]
multi:
# Primary backend storage used by multi-client.
# CLI flag: -ruler.ring.multi.primary
[primary: <string> | default = ""]

# Secondary backend storage used by multi-client.
# CLI flag: -ruler.multi.secondary
[secondary: <string> | default = ""]
# Secondary backend storage used by multi-client.
# CLI flag: -ruler.ring.multi.secondary
[secondary: <string> | default = ""]

# Mirror writes to secondary store.
# CLI flag: -ruler.multi.mirror-enabled
[mirror_enabled: <boolean> | default = false]
# Mirror writes to secondary store.
# CLI flag: -ruler.ring.multi.mirror-enabled
[mirror_enabled: <boolean> | default = false]

# Timeout for storing value to secondary store.
# CLI flag: -ruler.multi.mirror-timeout
[mirror_timeout: <duration> | default = 2s]
# Timeout for storing value to secondary store.
# CLI flag: -ruler.ring.multi.mirror-timeout
[mirror_timeout: <duration> | default = 2s]

# The heartbeat timeout after which ingesters are skipped for reads/writes.
# CLI flag: -ruler.ring.heartbeat-timeout
[heartbeat_timeout: <duration> | default = 1m0s]
# Period at which to heartbeat to the ring.
# CLI flag: -ruler.ring.heartbeat-period
[heartbeat_period: <duration> | default = 5s]

# The number of ingesters to write to and read from.
# CLI flag: -ruler.distributor.replication-factor
[replication_factor: <int> | default = 3]
# The heartbeat timeout after which rulers are considered unhealthy within the
# ring.
# CLI flag: -ruler.ring.heartbeat-timeout
[heartbeat_timeout: <duration> | default = 1m0s]

# Number of tokens for each ingester.
# CLI flag: -ruler.num-tokens
# CLI flag: -ruler.ring.num-tokens
[num_tokens: <int> | default = 128]

# Period at which to heartbeat to consul.
# CLI flag: -ruler.heartbeat-period
[heartbeat_period: <duration> | default = 5s]

# Observe tokens after generating to resolve collisions. Useful when using
# gossiping ring.
# CLI flag: -ruler.observe-period
[observe_period: <duration> | default = 0s]

# Period to wait for a claim from another member; will join automatically
# after this.
# CLI flag: -ruler.join-after
[join_after: <duration> | default = 0s]

# Minimum duration to wait before becoming ready. This is to work around race
# conditions with ingesters exiting and updating the ring.
# CLI flag: -ruler.min-ready-duration
[min_ready_duration: <duration> | default = 1m0s]

# Name of network interface to read address from.
# CLI flag: -ruler.lifecycler.interface
[interface_names: <list of string> | default = [eth0 en0]]

# Duration to sleep for before exiting, to ensure metrics are scraped.
# CLI flag: -ruler.final-sleep
[final_sleep: <duration> | default = 30s]

# File path where tokens are stored. If empty, tokens are not stored at
# shutdown and restored at startup.
# CLI flag: -ruler.tokens-file-path
[tokens_file_path: <string> | default = ""]

# Period with which to attempt to flush rule groups.
# CLI flag: -ruler.flush-period
[flushcheckperiod: <duration> | default = 1m0s]
Expand Down
26 changes: 26 additions & 0 deletions docs/guides/sharded_ruler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
title: "Config for horizontally scaling the Ruler"
linkTitle: "Config for horizontally scaling the Ruler"
weight: 4
slug: ruler-sharding
---

## Context

One option to scale the ruler is by scaling it horizontally. However, with multiple ruler instances running they will need to coordinate to determine which instance will evaluate which rule. Similar to the ingesters, the rulers establish a hash ring to divide up the responsibilities of evaluating rules.

## Config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rebasing and make doc you will also get the config file documentation updated. Could be worth adding a link to that doc (each root config block has an anchor, so you could link it to #ruler-config) mentioning it like To see the complete set of config option, please check out... 


In order to enable sharding in the ruler the following flag needs to be set:

```
-ruler.enable-sharding=true
```

In addition the ruler requires it's own ring to be configured, for instance:

```
-ruler.ring.consul.hostname=consul.dev.svc.cluster.local:8500
```

The only configuration that is required is to enable sharding and configure a key value store. From there the rulers will shard and handle the division of rules automatically.
2 changes: 1 addition & 1 deletion pkg/cortex/modules.go
Original file line number Diff line number Diff line change
Expand Up @@ -405,7 +405,7 @@ func (t *Cortex) stopTableManager() error {
}

func (t *Cortex) initRuler(cfg *Config) (err error) {
cfg.Ruler.LifecyclerConfig.ListenPort = &cfg.Server.GRPCListenPort
cfg.Ruler.Ring.ListenPort = cfg.Server.GRPCListenPort
queryable, engine := querier.New(cfg.Querier, t.distributor, t.store)

t.ruler, err = ruler.NewRuler(cfg.Ruler, engine, queryable, t.distributor)
Expand Down
8 changes: 4 additions & 4 deletions pkg/ruler/lifecycle_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,14 @@ func TestRulerShutdown(t *testing.T) {
}

test.Poll(t, 100*time.Millisecond, 0, func() interface{} {
return testutils.NumTokens(config.LifecyclerConfig.RingConfig.KVStore.Mock, "localhost", ring.RulerRingKey)
return testutils.NumTokens(config.Ring.KVStore.Mock, "localhost", ring.RulerRingKey)
})
}

// TestRulerRestart tests a restarting ruler doesn't keep adding more tokens.
func TestRulerRestart(t *testing.T) {
config := defaultRulerConfig()
config.LifecyclerConfig.SkipUnregister = true
config.Ring.SkipUnregister = true
config.EnableSharding = true

{
Expand All @@ -38,7 +38,7 @@ func TestRulerRestart(t *testing.T) {
}

test.Poll(t, 100*time.Millisecond, 1, func() interface{} {
return testutils.NumTokens(config.LifecyclerConfig.RingConfig.KVStore.Mock, "localhost", ring.RulerRingKey)
return testutils.NumTokens(config.Ring.KVStore.Mock, "localhost", ring.RulerRingKey)
})

{
Expand All @@ -50,6 +50,6 @@ func TestRulerRestart(t *testing.T) {
time.Sleep(200 * time.Millisecond)

test.Poll(t, 100*time.Millisecond, 1, func() interface{} {
return testutils.NumTokens(config.LifecyclerConfig.RingConfig.KVStore.Mock, "localhost", ring.RulerRingKey)
return testutils.NumTokens(config.Ring.KVStore.Mock, "localhost", ring.RulerRingKey)
})
}
9 changes: 5 additions & 4 deletions pkg/ruler/ruler.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,14 +61,14 @@ type Config struct {

EnableSharding bool // Enable sharding rule groups
SearchPendingFor time.Duration
LifecyclerConfig ring.LifecyclerConfig
Ring RingConfig
FlushCheckPeriod time.Duration
}

// RegisterFlags adds the flags required to config this to the given FlagSet
func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
cfg.LifecyclerConfig.RegisterFlagsWithPrefix("ruler.", f)
cfg.StoreConfig.RegisterFlags(f)
cfg.Ring.RegisterFlags(f)

// Deprecated Flags that will be maintained to avoid user disruption
flagext.DeprecatedFlag(f, "ruler.client-timeout", "This flag has been renamed to ruler.configs.client-timeout")
Expand Down Expand Up @@ -149,14 +149,15 @@ func NewRuler(cfg Config, engine *promql.Engine, queryable promStorage.Queryable
// If sharding is enabled, create/join a ring to distribute tokens to
// the ruler
if cfg.EnableSharding {
ruler.lifecycler, err = ring.NewLifecycler(cfg.LifecyclerConfig, ruler, "ruler", ring.RulerRingKey, true)
lifecyclerCfg := cfg.Ring.ToLifecyclerConfig()
ruler.lifecycler, err = ring.NewLifecycler(lifecyclerCfg, ruler, "ruler", ring.RulerRingKey, true)
if err != nil {
return nil, err
}

ruler.lifecycler.Start()

ruler.ring, err = ring.New(cfg.LifecyclerConfig.RingConfig, "ruler", ring.RulerRingKey)
ruler.ring, err = ring.New(lifecyclerCfg.RingConfig, "ruler", ring.RulerRingKey)
if err != nil {
return nil, err
}
Expand Down
92 changes: 92 additions & 0 deletions pkg/ruler/ruler_ring.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
package ruler

import (
"flag"
"os"
"time"

"github.com/cortexproject/cortex/pkg/ring"
"github.com/cortexproject/cortex/pkg/ring/kv"
"github.com/cortexproject/cortex/pkg/util"
"github.com/cortexproject/cortex/pkg/util/flagext"
"github.com/go-kit/kit/log/level"
)

// RingConfig masks the ring lifecycler config which contains
// many options not really required by the rulers ring. This config
// is used to strip down the config to the minimum, and avoid confusion
// to the user.
type RingConfig struct {
KVStore kv.Config `yaml:"kvstore,omitempty"`
HeartbeatPeriod time.Duration `yaml:"heartbeat_period,omitempty"`
HeartbeatTimeout time.Duration `yaml:"heartbeat_timeout,omitempty"`

// Instance details
InstanceID string `yaml:"instance_id" doc:"hidden"`
InstanceInterfaceNames []string `yaml:"instance_interface_names" doc:"hidden"`
InstancePort int `yaml:"instance_port" doc:"hidden"`
InstanceAddr string `yaml:"instance_addr" doc:"hidden"`
NumTokens int `yaml:"num_tokens"`

// Injected internally
ListenPort int `yaml:"-"`

// Used for testing
SkipUnregister bool `yaml:"-"`
}

// RegisterFlags adds the flags required to config this to the given FlagSet
func (cfg *RingConfig) RegisterFlags(f *flag.FlagSet) {
hostname, err := os.Hostname()
if err != nil {
level.Error(util.Logger).Log("msg", "failed to get hostname", "err", err)
os.Exit(1)
}

// Ring flags
cfg.KVStore.RegisterFlagsWithPrefix("ruler.ring.", "rulers/", f)
f.DurationVar(&cfg.HeartbeatPeriod, "ruler.ring.heartbeat-period", 5*time.Second, "Period at which to heartbeat to the ring.")
f.DurationVar(&cfg.HeartbeatTimeout, "ruler.ring.heartbeat-timeout", time.Minute, "The heartbeat timeout after which rulers are considered unhealthy within the ring.")

// Instance flags
cfg.InstanceInterfaceNames = []string{"eth0", "en0"}
f.Var((*flagext.Strings)(&cfg.InstanceInterfaceNames), "ruler.ring.instance-interface", "Name of network interface to read address from.")
f.StringVar(&cfg.InstanceAddr, "ruler.ring.instance-addr", "", "IP address to advertise in the ring.")
f.IntVar(&cfg.InstancePort, "ruler.ring.instance-port", 0, "Port to advertise in the ring (defaults to server.grpc-listen-port).")
f.StringVar(&cfg.InstanceID, "ruler.ring.instance-id", hostname, "Instance ID to register in the ring.")
f.IntVar(&cfg.NumTokens, "ruler.ring.num-tokens", 128, "Number of tokens for each ingester.")
}

// ToLifecyclerConfig returns a LifecyclerConfig based on the ruler
// ring config.
func (cfg *RingConfig) ToLifecyclerConfig() ring.LifecyclerConfig {
// We have to make sure that the ring.LifecyclerConfig and ring.Config
// defaults are preserved
lc := ring.LifecyclerConfig{}
rc := ring.Config{}

flagext.DefaultValues(&lc)
flagext.DefaultValues(&rc)

// Configure ring
rc.KVStore = cfg.KVStore
rc.HeartbeatTimeout = cfg.HeartbeatTimeout
rc.ReplicationFactor = 1

// Configure lifecycler
lc.RingConfig = rc
lc.ListenPort = &cfg.ListenPort
lc.Addr = cfg.InstanceAddr
lc.Port = cfg.InstancePort
lc.ID = cfg.InstanceID
lc.InfNames = cfg.InstanceInterfaceNames
lc.SkipUnregister = cfg.SkipUnregister
lc.HeartbeatPeriod = cfg.HeartbeatPeriod
lc.NumTokens = cfg.NumTokens
lc.ObservePeriod = 0
lc.JoinAfter = 0
lc.MinReadyDuration = 0
lc.FinalSleep = 0

return lc
}
14 changes: 6 additions & 8 deletions pkg/ruler/ruler_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,12 @@ func defaultRulerConfig() Config {
},
}
flagext.DefaultValues(&cfg)
flagext.DefaultValues(&cfg.LifecyclerConfig)
cfg.LifecyclerConfig.RingConfig.ReplicationFactor = 1
cfg.LifecyclerConfig.RingConfig.KVStore.Mock = consul
cfg.LifecyclerConfig.NumTokens = 1
cfg.LifecyclerConfig.FinalSleep = time.Duration(0)
cfg.LifecyclerConfig.ListenPort = func(i int) *int { return &i }(0)
cfg.LifecyclerConfig.Addr = "localhost"
cfg.LifecyclerConfig.ID = "localhost"
flagext.DefaultValues(&cfg.Ring)
cfg.Ring.KVStore.Mock = consul
cfg.Ring.NumTokens = 1
cfg.Ring.ListenPort = 0
cfg.Ring.InstanceAddr = "localhost"
cfg.Ring.InstanceID = "localhost"
return cfg
}

Expand Down