Skip to content

wip: support native histograms along classic histograms in panels #1121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions grafana-builder/grafana.libsonnet
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
local utils = import 'mixin-utils/utils.libsonnet';

{
dashboard(title, uid='', datasource='default', datasource_regex=''):: {
// Stuff that isn't materialised.
Expand Down Expand Up @@ -448,6 +450,38 @@
],
} + $.stack,

// Assumes that the metricName is for a histogram (as opposed to qpsPanel above)
qpsPanelNativeHistogram(metricName, selector, statusLabelName='status_code'):: {
aliasColors: {
'1xx': '#EAB839',
'2xx': '#7EB26D',
'3xx': '#6ED0E0',
'4xx': '#EF843C',
'5xx': '#E24D42',
OK: '#7EB26D',
success: '#7EB26D',
'error': '#E24D42',
cancel: '#A9A9A9',
},
targets: [
{
expr:
|||
sum by (status) (
label_replace(label_replace(%(metricQuery)s,
"status", "${1}xx", "%(label)s", "([0-9]).."),
"status", "${1}", "%(label)s", "([a-zA-Z]+)"))
||| % {
metricQuery: utils.nativeClassicHistogramCountRate(metricName, selector),
label: statusLabelName,
},
format: 'time_series',
legendFormat: '{{status}}',
refId: 'A',
},
],
} + $.stack,

latencyPanel(metricName, selector, multiplier='1e3'):: {
nullPointMode: 'null as zero',
targets: [
Expand All @@ -473,6 +507,45 @@
yaxes: $.yaxes('ms'),
},

latencyPanelNativeHistogram(metricName, selector, multiplier='1e3'):: {
nullPointMode: 'null as zero',
targets: [
{
expr: '(%(metricQuery)s) * %(multiplier)s' % {
metricQuery: utils.nativeClassicHistogramQuantile('0.99', metricName, selector),
multiplier: multiplier,
},
format: 'time_series',
legendFormat: '99th percentile',
refId: 'A',
},
{
expr: '(%(metricQuery)s) * %(multiplier)s' % {
metricQuery: utils.nativeClassicHistogramQuantile('0.50', metricName, selector),
multiplier: multiplier,
},
format: 'time_series',
legendFormat: '50th percentile',
refId: 'B',
},
{
expr:
|||
%(multiplier)s * sum(%(sumMetricQuery)s) /
sum(%(countMetricQuery)s)
||| % {
sumMetricQuery: utils.nativeClassicHistogramSumRate(metricName, selector),
countMetricQuery: utils.nativeClassicHistogramCountRate(metricName, selector),
multiplier: multiplier,
},
format: 'time_series',
legendFormat: 'Average',
refId: 'C',
},
],
yaxes: $.yaxes('ms'),
},

selector:: {
eq(label, value):: { label: label, op: '=', value: value },
neq(label, value):: { label: label, op: '!=', value: value },
Expand Down
101 changes: 101 additions & 0 deletions mixin-utils/utils.libsonnet
Original file line number Diff line number Diff line change
@@ -1,6 +1,38 @@
local g = import 'grafana-builder/grafana.libsonnet';

{
// The classicNativeHistogramQuantile function is used to calculate histogram quantiles from native histograms or classic histograms.
// Metric name should be provided without _bucket suffix.
nativeClassicHistogramQuantile(percentile, metric, selector, sum_by=[], rate_interval='$__rate_interval')::
local classicSumBy = if std.length(sum_by) > 0 then ' by (%(lbls)s) ' % { lbls: std.join(',', ['le'] + sum_by) } else ' by (le) ';
local nativeSumBy = if std.length(sum_by) > 0 then ' by (%(lbls)s) ' % { lbls: std.join(',', sum_by) } else ' ';
'histogram_quantile(%(percentile)s, sum%(nativeSumBy)s(rate(%(metric)s{%(selector)s}[%(rateInterval)s]))) or histogram_quantile(%(percentile)s, sum%(classicSumBy)s(rate(%(metric)s_bucket{%(selector)s}[%(rateInterval)s])))' % {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reviewers: e.g.:

(histogram_quantile(0.99, sum (rate(cortex_request_duration_seconds{}[$__rate_interval]))) or
histogram_quantile(0.99, sum by (le) (rate(cortex_request_duration_seconds_bucket{}[$__rate_interval]))))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be improved.

Imagine you have a fairly long rate_interval (the problem exists for all intervals, but the longer, the more serious it gets), e.g. multiple days or so. While migrating to native histograms, you ingest classic and native histograms in parallel. Very soon, the first leg of the query above will yield a result, but it will be based just on a few samples of native histograms, or the last few minutes of the multi-day range.

So what you want is to use the leg of the query that has a complete coverage of the rate_interval (and in doubt prefer the native histogram). I'll try to come up with a way to do that in the next comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could work:

(
      histogram_quantile(
        0.99,
        sum(rate(cortex_request_duration_seconds[1w])) and group(cortex_request_duration_seconds offset 1w)
      )
    or
      histogram_quantile(
        0.99,
          sum by (le) (rate(cortex_request_duration_seconds_bucket[1w]))
        and
          group by (le) (cortex_request_duration_seconds_bucket)
      )
  or
    histogram_quantile(0.99, sum(rate(cortex_request_duration_seconds[1w])))
)

This assumes the migration is already going from classic to native histograms. First, the query checks if there has been a native histogram 1w ago. If so, it goes for native histograms. Second, it checks if there is a classic histogram now, and uses them. As the third and final option, it uses native histograms (which is relevant for a new metric that never had classic histograms).

Of course, the stupid PromQL engine will calculate all three legs, even if only one will ever be used.

classicSumBy: classicSumBy,
metric: metric,
nativeSumBy: nativeSumBy,
percentile: percentile,
rateInterval: rate_interval,
selector: selector,
},

// The classicNativeHistogramSumRate function is used to calculate the histogram sum of rate from native histograms or classic histograms.
// Metric name should be provided without _sum suffix.
nativeClassicHistogramSumRate(metric, selector, rate_interval='$__rate_interval')::
'histogram_sum(rate(%(metric)s{%(selector)s}[%(rateInterval)s])) or rate(%(metric)s_sum{%(selector)s}[%(rateInterval)s])' % {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reviewers e.g., in context of a sum:

sum(
 histogram_sum(rate(cortex_request_duration_seconds{}[$__rate_interval])) or
 rate(cortex_request_duration_seconds_sum{}[$__rate_interval])
)

metric: metric,
rateInterval: rate_interval,
selector: selector,
},

// The classicNativeHistogramCountRate function is used to calculate the histogram count of rate from native histograms or classic histograms.
// Metric name should be provided without _count suffix.
nativeClassicHistogramCountRate(metric, selector, rate_interval='$__rate_interval')::
'histogram_count(rate(%(metric)s{%(selector)s}[%(rateInterval)s])) or rate(%(metric)s_count{%(selector)s}[%(rateInterval)s])' % {
metric: metric,
rateInterval: rate_interval,
selector: selector,
},

histogramRules(metric, labels, interval='1m')::
local vars = {
metric: metric,
Expand Down Expand Up @@ -96,6 +128,75 @@ local g = import 'grafana-builder/grafana.libsonnet';
],
},

// not in use yet
// latencyRecordingRulePanelNativeHistogram(metric, selectors, extra_selectors=[], multiplier='1e3', sum_by=[])::
// local labels = std.join('_', [matcher.label for matcher in selectors]);
// local selectorStr = $.toPrometheusSelector(selectors + extra_selectors);
// local sb = ['le'];
// local legend = std.join('', ['{{ %(lb)s }} ' % lb for lb in sum_by]);
// // sumBy is used in the averge calculation and also for native histograms where 'le' is not used
// local sumBy = if std.length(sum_by) > 0 then ' by (%(lbls)s) ' % { lbls: std.join(',', sum_by) } else '';
// local sumByHisto = std.join(',', sb + sum_by);
// {
// nullPointMode: 'null as zero',
// yaxes: g.yaxes('ms'),
// targets: [
// {
// expr:
// |||
// (histogram_quantile(0.99, sum by (%(sumBy)s) (%(labels)s:%(metric)s:sum_rate%(selector)s)) or
// histogram_quantile(0.99, sum by (%(sumByHisto)s) (%(labels)s:%(metric)s_bucket:sum_rate%(selector)s))) * %(multiplier)s
// ||| % {
// labels: labels,
// metric: metric,
// selector: selectorStr,
// multiplier: multiplier,
// sumBy: sumBy,
// sumByHisto: sumByHisto,
// },
// format: 'time_series',
// legendFormat: '%(legend)s99th percentile' % legend,
// refId: 'A',
// step: 10,
// },
// {
// expr:
// |||
// (histogram_quantile(0.50, sum by (%(sumBy)s) (%(labels)s:%(metric)s:sum_rate%(selector)s)) or
// histogram_quantile(0.50, sum by (%(sumByHisto)s) (%(labels)s:%(metric)s_bucket:sum_rate%(selector)s))) * %(multiplier)s
// ||| % {
// labels: labels,
// metric: metric,
// selector: selectorStr,
// multiplier: multiplier,
// sumBy: sumBy,
// sumByHisto: sumByHisto,
// },
// format: 'time_series',
// legendFormat: '%(legend)s50th percentile' % legend,
// refId: 'B',
// step: 10,
// },
// {
// expr:
// |||
// %(multiplier)s * (histogram_sum(sum(%(labels)s:%(metric)s:sum_rate%(selector)s)%(sumBy)s) or sum(%(labels)s:%(metric)s_sum:sum_rate%(selector)s)%(sumBy)s) /
// (histogram_count(sum(%(labels)s:%(metric)s:sum_rate%(selector)s)%(sumBy)s) or sum(%(labels)s:%(metric)s_count:sum_rate%(selector)s)%(sumBy)s)
// ||| % {
// labels: labels,
// metric: metric,
// selector: selectorStr,
// multiplier: multiplier,
// sumBy: sumBy,
// },
// format: 'time_series',
// legendFormat: '%(legend)sAverage' % legend,
// refId: 'C',
// step: 10,
// },
// ],
// },

selector:: {
eq(label, value):: { label: label, op: '=', value: value },
neq(label, value):: { label: label, op: '!=', value: value },
Expand Down