Skip to content

Commit 68a44de

Browse files
Integrate CRUD statistics with metrics
If `metrics` [1] found, you can use metrics collectors to store statistics. It is required to use `>= 0.9.0` to support age buckets in summary and crucial bugfixes under high load [2]. The metrics are part of global registry and can be exported together (e.g. to Prometheus) with default tools without any additional configuration. Disabling stats destroys the collectors. Local collectors are used by default. To use metrics driver, call `crud.enable_stats{ driver = 'metrics' }`. Be wary that using metrics collectors may drop overall performance. Running them with existing perf tests have shown the drop to 2-3 times in rps. Raising quantile tolerance may result in even more crucial performance drops. If `metrics` used, `latency` statistics are changed to 0.99 quantile of request execution time (with aging). Add CI matrix to run tests with `metrics` installed. 1. https://github.com/tarantool/metrics 2. tarantool/metrics#235 Closes #224
1 parent 8545818 commit 68a44de

File tree

7 files changed

+702
-42
lines changed

7 files changed

+702
-42
lines changed

.github/workflows/test_on_push.yaml

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,19 @@ jobs:
1313
matrix:
1414
# We need 1.10.6 here to check that module works with
1515
# old Tarantool versions that don't have "tuple-keydef"/"tuple-merger" support.
16-
tarantool-version: ["1.10.6", "1.10", "2.2", "2.3", "2.4", "2.5", "2.6", "2.7"]
16+
tarantool-version: ["1.10.6", "1.10", "2.2", "2.3", "2.4", "2.5", "2.6", "2.7", "2.8"]
17+
metrics-version: [""]
1718
remove-merger: [false]
1819
include:
1920
- tarantool-version: "2.7"
2021
remove-merger: true
22+
- tarantool-version: "2.8"
23+
metrics-version: "0.1.8"
24+
- tarantool-version: "2.8"
25+
metrics-version: "0.9.0"
2126
- tarantool-version: "2.8"
2227
coveralls: true
28+
metrics-version: "0.12.0"
2329
fail-fast: false
2430
runs-on: [ubuntu-latest]
2531
steps:
@@ -47,6 +53,10 @@ jobs:
4753
tarantool --version
4854
./deps.sh
4955
56+
- name: Install metrics
57+
if: matrix.metrics-version != ''
58+
run: tarantoolctl rocks install metrics ${{ matrix.metrics-version }}
59+
5060
- name: Remove external merger if needed
5161
if: ${{ matrix.remove-merger }}
5262
run: rm .rocks/lib/tarantool/tuple/merger.so

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
99

1010
### Added
1111
* Statistics for CRUD operations on router (#224).
12+
* Integrate CRUD statistics with `metrics` (#224).
1213

1314
### Changed
1415

README.md

Lines changed: 47 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -603,9 +603,23 @@ crud.disable_stats()
603603
-- Enable statistics collect and recreates all collectors.
604604
crud.enable_stats()
605605
```
606-
While statistics collection should not affect performance
607-
in a noticeable way, you may disable it if you want to
608-
prioritize performance.
606+
607+
If [`metrics`](https://github.com/tarantool/metrics) found,
608+
you can use metrics collectors to store statistics
609+
instead of local collectors.
610+
It is required to use version `0.9.0` or greater,
611+
otherwise local collectors will be used.
612+
```
613+
-- Use metrics collectors.
614+
crud.enable_stats({ driver = 'metrics' })
615+
```
616+
By default, local collectors (`{ driver = 'local' }`)
617+
are used. Metrics collectors are much sophisticated and
618+
would show execution time quantile with aging for calls.
619+
Be wary that computing quantiles may affect overall
620+
performance under high load. Using local
621+
collectors or disabling stats is an option if
622+
you want to prioritize performance.
609623

610624
Enabling stats on non-router instances is meaningless.
611625

@@ -631,9 +645,34 @@ crud.stats()['insert']
631645
Each section contains different collectors for success calls
632646
and error (both error throw and `nil, err`) returns. `count`
633647
is total requests count since instance start or stats restart.
634-
`latency` is average time of requests execution,
648+
`latency` is 0.99 quantile of request execution time if `metrics`
649+
driver used, otherwise `latency` is total average.
635650
`time` is total time of requests execution.
636651

652+
In `metrics` registry statistics are stored as `tnt_crud_stats` metrics
653+
with `operation` and `status` label_pairs.
654+
```
655+
metrics:collect()
656+
---
657+
- - label_pairs:
658+
status: ok
659+
operation: insert
660+
value: 221411
661+
metric_name: tnt_crud_stats_count
662+
- label_pairs:
663+
status: ok
664+
operation: insert
665+
value: 10.49834896344692
666+
metric_name: tnt_crud_stats_sum
667+
- label_pairs:
668+
status: ok
669+
operation: insert
670+
quantile: 0.99
671+
value: 0.00023606420935973
672+
metric_name: tnt_crud_stats
673+
...
674+
```
675+
637676
Additionally, `select` section contains `details` collectors.
638677
```lua
639678
crud.stats()['select']['details']
@@ -647,7 +686,10 @@ crud.stats()['select']['details']
647686
(including those not executed successfully). `tuples_fetched`
648687
is a count of tuples fetched from storages during execution,
649688
`tuples_lookup` is a count of tuples looked up on storages
650-
while collecting response for call.
689+
while collecting response for call. In `metrics` registry they
690+
are stored as `tnt_crud_map_reduces`, `tnt_crud_tuples_fetched`
691+
and `tnt_crud_tuples_lookup` metrics with
692+
`{ operation = 'select' }` label_pairs.
651693

652694
## Cartridge roles
653695

crud/stats/metrics_registry.lua

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
local is_package, metrics = pcall(require, 'metrics')
2+
3+
local label = require('crud.stats.label')
4+
local dev_checks = require('crud.common.dev_checks')
5+
local registry_common = require('crud.stats.registry_common')
6+
7+
local registry = {}
8+
local _registry = {}
9+
10+
local metric_name = {
11+
-- Summary collector for all operations.
12+
op = 'tnt_crud_stats',
13+
-- `*_count` and `*_sum` are automatically created
14+
-- by summary collector.
15+
op_count = 'tnt_crud_stats_count',
16+
op_sum = 'tnt_crud_stats_sum',
17+
18+
-- Counter collectors for select/pairs details.
19+
tuples_fetched = 'tnt_crud_tuples_fetched',
20+
tuples_lookup = 'tnt_crud_tuples_lookup',
21+
map_reduces = 'tnt_crud_map_reduces',
22+
}
23+
24+
local LATENCY_QUANTILE = 0.99
25+
26+
-- Raising quantile tolerance (1e-2) may result in crucial
27+
-- performance drops.
28+
local DEFAULT_QUANTILES = {
29+
[LATENCY_QUANTILE] = 1e-2,
30+
}
31+
32+
local DEFAULT_SUMMARY_PARAMS = {
33+
age_buckets_count = 2,
34+
max_age_time = 60,
35+
}
36+
37+
--- Check if application supports metrics rock for registry
38+
--
39+
-- `metrics >= 0.9.0` is required to use summary with
40+
-- age buckets. `metrics >= 0.5.0, < 0.9.0` is unsupported
41+
-- due to quantile overflow bug
42+
-- (https://github.com/tarantool/metrics/issues/235).
43+
--
44+
-- @function is_supported
45+
--
46+
-- @treturn boolean Returns true if `metrics >= 0.9.0` found, false otherwise.
47+
--
48+
function registry.is_supported()
49+
if is_package == false then
50+
return false
51+
end
52+
53+
-- Only metrics >= 0.9.0 supported.
54+
local is_summary, summary = pcall(require, 'metrics.collectors.summary')
55+
if is_summary == false or summary.rotate_age_buckets == nil then
56+
return false
57+
end
58+
59+
return true
60+
end
61+
62+
63+
--- Initialize collectors in global metrics registry
64+
--
65+
-- @function init
66+
--
67+
-- @treturn boolean Returns true.
68+
--
69+
function registry.init()
70+
_registry[metric_name.op] = metrics.summary(
71+
metric_name.op,
72+
'CRUD router calls statistics',
73+
DEFAULT_QUANTILES,
74+
DEFAULT_SUMMARY_PARAMS)
75+
76+
_registry[metric_name.tuples_fetched] = metrics.counter(
77+
metric_name.tuples_fetched,
78+
'Tuples fetched from CRUD storages during select/pairs')
79+
80+
_registry[metric_name.tuples_lookup] = metrics.counter(
81+
metric_name.tuples_lookup,
82+
'Tuples looked up on CRUD storages while collecting response during select/pairs')
83+
84+
_registry[metric_name.map_reduces] = metrics.counter(
85+
metric_name.map_reduces,
86+
'Map reduces planned during CRUD select/pairs')
87+
88+
return true
89+
end
90+
91+
--- Unregister collectors in global metrics registry
92+
--
93+
-- @function destroy
94+
--
95+
-- @treturn boolean Returns true.
96+
--
97+
function registry.destroy()
98+
for _, c in pairs(_registry) do
99+
metrics.registry:unregister(c)
100+
end
101+
102+
_registry = {}
103+
return true
104+
end
105+
106+
--- Get copy of global metrics registry
107+
--
108+
-- @function get
109+
--
110+
-- @treturn table Returns copy of metrics registry.
111+
function registry.get()
112+
local stats = {}
113+
114+
-- Fill empty collectors with zero values.
115+
for _, op_label in pairs(label) do
116+
stats[op_label] = registry_common.build_collector(op_label)
117+
end
118+
119+
for _, obs in ipairs(_registry[metric_name.op]:collect()) do
120+
local operation = obs.label_pairs.operation
121+
local status = obs.label_pairs.status
122+
if obs.metric_name == metric_name.op then
123+
if obs.label_pairs.quantile == LATENCY_QUANTILE then
124+
stats[operation][status].latency = obs.value
125+
end
126+
elseif obs.metric_name == metric_name.op_sum then
127+
stats[operation][status].time = obs.value
128+
elseif obs.metric_name == metric_name.op_count then
129+
stats[operation][status].count = obs.value
130+
end
131+
end
132+
133+
local _, obs_tuples_fetched = next(_registry[metric_name.tuples_fetched]:collect())
134+
if obs_tuples_fetched ~= nil then
135+
stats[label.SELECT].details.tuples_fetched = obs_tuples_fetched.value
136+
end
137+
138+
local _, obs_tuples_lookup = next(_registry[metric_name.tuples_lookup]:collect())
139+
if obs_tuples_lookup ~= nil then
140+
stats[label.SELECT].details.tuples_lookup = obs_tuples_lookup.value
141+
end
142+
143+
local _, obs_map_reduces = next(_registry[metric_name.map_reduces]:collect())
144+
if obs_map_reduces ~= nil then
145+
stats[label.SELECT].details.map_reduces = obs_map_reduces.value
146+
end
147+
148+
return stats
149+
end
150+
151+
--- Increase requests count and update latency info
152+
--
153+
-- @function observe
154+
--
155+
-- @tparam string op_label
156+
-- Label of registry collectos.
157+
-- Use `require('crud.common.const').OP` to pick one.
158+
--
159+
-- @tparam boolean success
160+
-- true if no errors on execution, false otherwise.
161+
--
162+
-- @tparam number latency
163+
-- Time of call execution.
164+
--
165+
-- @treturn boolean Returns true.
166+
--
167+
168+
local total = 0
169+
170+
function registry.observe(op_label, success, latency)
171+
dev_checks('string', 'boolean', 'number')
172+
173+
local label_pairs = { operation = op_label }
174+
if success == true then
175+
label_pairs.status = 'ok'
176+
else
177+
label_pairs.status = 'error'
178+
end
179+
180+
local clock = require('clock')
181+
local start = clock.monotonic()
182+
_registry[metric_name.op]:observe(latency, label_pairs)
183+
local diff = clock.monotonic() - start
184+
-- require('log').error("latency: %f", latency)
185+
-- require('log').error("diff: %f", diff)
186+
total = total + diff
187+
-- require('log').error("total: %f", total)
188+
189+
return true
190+
end
191+
192+
--- Increase statistics of storage select/pairs calls
193+
--
194+
-- @function observe_fetch
195+
--
196+
-- @tparam number tuples_fetched
197+
-- Count of tuples fetched during storage call.
198+
--
199+
-- @tparam number tuples_lookup
200+
-- Count of tuples looked up on storages while collecting response.
201+
--
202+
-- @treturn boolean Returns true.
203+
--
204+
function registry.observe_fetch(tuples_fetched, tuples_lookup)
205+
dev_checks('number', 'number')
206+
207+
local label_pairs = { operation = label.SELECT }
208+
209+
_registry[metric_name.tuples_fetched]:inc(tuples_fetched, label_pairs)
210+
_registry[metric_name.tuples_lookup]:inc(tuples_lookup, label_pairs)
211+
return true
212+
end
213+
214+
--- Increase statistics of planned map reduces during select/pairs
215+
--
216+
-- @function observe_map_reduces
217+
--
218+
-- @tparam number count
219+
-- Count of map reduces planned.
220+
--
221+
-- @treturn boolean Returns true.
222+
--
223+
function registry.observe_map_reduces(count)
224+
dev_checks('number')
225+
226+
local label_pairs = { operation = label.SELECT }
227+
228+
_registry[metric_name.map_reduces]:inc(count, label_pairs)
229+
return true
230+
end
231+
232+
return registry

0 commit comments

Comments
 (0)