Skip to content

Conversation

jtlisi
Copy link
Contributor

@jtlisi jtlisi commented May 28, 2020

What this PR does:

  • Add a name and backend type label to the prometheus metrics that wrap KV clients
  • Remove dedicated instrumentation for the consul client in favor of a generic instrumentation for all clients

Which issue(s) this PR fixes:
Fixes #2484

Checklist

  • Tests updated
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@jtlisi jtlisi changed the title initial attempt Generic KV Client Instrumentation May 28, 2020
@jtlisi jtlisi force-pushed the 20200528_generic_kv_metrics branch 2 times, most recently from de55827 to 170eb06 Compare May 28, 2020 20:20
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with having generic metrics client only, although I've found ability to see individual Consul "CAS" calls useful in the past.

After some recent changes related to Cortex/Thanos metrics, I now think that passing name to individual components that is only used for ConstLabels is an anti-pattern:

  • It is giving component more responsibility than it needs
  • Name is useless is registrerer is nil
  • Name is not used by sub-components (eg. if Consul client itself wanted to register its own metrics, that would create conflict)

Better way of handling multiple components with same metrics is to wrap registrerer to add constant label to all metrics registered by component: prometheus.WrapRegistererWith(prometheus.Labels{"name": "component"}, reg) (if reg is not nil. extprom.WrapRegistererWith() from Thanos handles nil case as well), and pass such wrapped registrer to component. This solves all problems mentioned above.

/cc @pracucci to hear your opinion on this as well

Comment on lines 30 to 48
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure to do that before merging please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. This TODO should be removed (and the default register never used).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think passing name to components that is used just for metrics registration is an anti-pattern. Caller of this call can easily wrap reg into prometheus.WrapRegistererWith(prometheus.Labels{"name": "some name here"}, reg) if it wishes to do so. I would prefer to use this pattern instead. KV store client doesn't need a name for anything else, and it also doesn't need it if registrerer is nil.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great opportunity to pass registrerer to lifecycler and move lifecycler metrics away from global variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe in another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will pass the Registerer but keep the lifecycler metrics global for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we could see individual CAS calls to consul. Now we will only see "CAS loop".

@pracucci
Copy link
Contributor

pracucci commented May 29, 2020

Better way of handling multiple components with same metrics is to wrap registrerer to add constant label to all metrics registered by component

Agree. It's more flexible.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job @jtlisi! I left few comments as well.

Comment on lines 30 to 48
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. This TODO should be removed (and the default register never used).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated name here looks a bit weird to me. Maybe querier-store-gateway may be more clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I wasn't sure how best to name some of these clients.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I notice, I think previous version was even wrong. We did track "Put" twice for 1 Put call. Once here, and once in consul/metrics.go. Am I missing anything?

Copy link
Contributor Author

@jtlisi jtlisi Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After investigating further, the consul metrics probably should remain as is. They provide lower level metrics for the client interacting with consul itself: https://github.com/hashicorp/consul/blob/master/api/kv.go

In our KV implementation CAS is retried multiple times and by using a generic metrics client we are missing out on instrumenting some of these calls. One option that may be used to solve this problem is moving retries to its own KV implementation and wrap clients with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to re-add the separate consul instrumentation and will add an issue to add a generic retry implementation for the KV store interface. WDYT?

Comment on lines 43 to 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention this change in the CHANGELOG. You should also check if it has an impact on the cortex mixin's dashboards (depends on how we group series, I haven't checked).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe in another PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remember to mention in the CHANGELOG that the metric cortex_consul_request_duration_seconds has been removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-added this metric since it fits a different use case.

@jtlisi jtlisi force-pushed the 20200528_generic_kv_metrics branch from ff4541e to 40c71c7 Compare June 1, 2020 20:07
@jtlisi
Copy link
Contributor Author

jtlisi commented Jun 1, 2020

@pstibrany & @pracucci

I updated this PR per your comments. Changes include:

  • Keeping the cortex_consul_request_duration_seconds metric and consul instrumentation w/ note about a possible way of removing it in a future PR
  • Registerers are now plumbed through to each call of the kv.NewClient
  • Updated ChangeLog

I will test this PR w/ the cortex-mixin this week and ensure it operates correctly with the new metrics.

If you have anymore feedback about the naming of KV clients let me know. I wasn't sure if some of the names/suffixes I chose made sense.

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have couple of nits, but lgtm overall. Thanks!

UpdateTimeout: 100 * time.Millisecond,
FailoverTimeout: time.Second,
})
}, prometheus.NewRegistry())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}, prometheus.NewRegistry())
}, nil)

Since we don't verify metrics.

Comment on lines 45 to 48
// If no Registerer is provided return the raw client
if reg == nil {
return c
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd move this check into createClient function.

return nil
}

return prometheus.WrapRegistererWith(prometheus.Labels{"name": name}, reg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth using something more unique, e.g. kv-name here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I was just basing it off the label we use for the ring.

Comment on lines 14 to 15
primaryLabel = prometheus.Labels{"role": "primary"}
secondaryLabel = prometheus.Labels{"role": "secondary"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only used from client.go, I would move it into that file.

CHANGELOG.md Outdated

## master / unreleased

* [CHANGE] Metric `cortex_kv_request_duration_seconds` now includes `name` label to denote which client is being used as well as the `backend` label to denote the KV backend implemnetation in use. #2648
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* [CHANGE] Metric `cortex_kv_request_duration_seconds` now includes `name` label to denote which client is being used as well as the `backend` label to denote the KV backend implemnetation in use. #2648
* [CHANGE] Metric `cortex_kv_request_duration_seconds` now includes `name` label to denote which client is being used as well as the `backend` label to denote the KV backend implementation in use. #2648

Comment on lines 16 to 19
var (
RingNameForClient = "distributor"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var (
RingNameForClient = "distributor"
)
const (
ringNameForClient = "distributor"
)

And move to distributor_ring.go where it is used.

(Alternatively, just inline it)

ringStore, err := kv.NewClient(
cfg.Ring.KVStore,
ring.GetCodec(),
kv.RegistererWithKVName(reg, ring.RulerRingKey),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kv.RegistererWithKVName(reg, ring.RulerRingKey),
kv.RegistererWithKVName(reg, "ruler"),

ring.RulerRingKey is actually ring.

ringStore, err = kv.NewClient(
gatewayCfg.ShardingRing.KVStore,
ring.GetCodec(),
kv.RegistererWithKVName(reg, RingNameForClient),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kv.RegistererWithKVName(reg, RingNameForClient),
kv.RegistererWithKVName(reg, "store-gateway"),

Signed-off-by: Jacob Lisi <[email protected]>
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for patiently address my comments!

@pracucci pracucci merged commit 1188a91 into cortexproject:master Jun 4, 2020
@pracucci pracucci deleted the 20200528_generic_kv_metrics branch June 4, 2020 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need a way to differentiate between KV metrics for different rings and dudupers in single binary mode.
3 participants