Dashboard menu (table of contents) redesign to be driven by OpenTelemetry standards #354

New issue

Open

Dashboard menu (table of contents) redesign to be driven by OpenTelemetry standards#354

Assignees

Labels

area/dashboardscloud-backendcloud-frontend

Milestone

[2023] Summer

ktsaou

opened

on Mar 18, 2022

· edited by ktsaou

Member

The way our dashboard menu is currently built, relies on the chart names. This is hugely problematic, and for years I have been considering it "my shame".

https://github.com/netdata/netdata/blob/561557c5ae2f427033fb3809e52e8db7cbecf7f7/web/gui/main.js#L1361-L1363

So, I have been discussing OpenTelemetry with @ilyam8 and we were doing some diagrams on the wall:

I think we have a viable solution to re-use context in a way that will bring netdata a lot closer to OpenTelemetry and will also allow us to come with a much better logic for creating the table of contents of the dashboards.

A logic that may also fix #351 and #347 and probably will eliminate a ton more bugs waiting to be found.

To validate the above assumption I need the following export from the charts on the cloud:

chart_id	chart_name	context	family	plugin	module	units	count
-	-	-	-	-	-	-	the number of charts with this combination

@papazach is it possible to have this export?

added

and removed

MemberAuthor

The OpenTelemetry Timeseries Model, is like this:

In this low-level metrics data model, a Timeseries is defined by an entity consisting of several metadata properties:

Metric name

Attributes (dimensions)

Kind of point (integer, floating point, etc)

Unit of measurement

In netdata the Metric Name is the context. This is also done in our prometheus exporter, where the metrics we send are actually the contexts, with attributes/labels our dimensions. Example:

# curl 2>/dev/null 'http://localhost:19999/api/v1/allmetrics?format=prometheus' | grep "^netdata_net_net_kilobits_persec_average"          ✔ 
netdata_net_net_kilobits_persec_average{chart="net.enp59s0u2u4",family="enp59s0u2u4",dimension="received"} 33.0627131 1647684792000
netdata_net_net_kilobits_persec_average{chart="net.enp59s0u2u4",family="enp59s0u2u4",dimension="sent"} -58.6417316 1647684792000
netdata_net_net_kilobits_persec_average{chart="net.docker0",family="docker0",dimension="received"} 1.5386666 1647684792000
netdata_net_net_kilobits_persec_average{chart="net.docker0",family="docker0",dimension="sent"} -2.5285036 1647684792000
netdata_net_net_kilobits_persec_average{chart="net.virbr0",family="virbr0",dimension="received"} 0.0332308 1647684792000
netdata_net_net_kilobits_persec_average{chart="net.virbr0",family="virbr0",dimension="sent"} -0.5177434 1647684792000

Changing the context

The context in netdata is something easy to change. It is only used:

In charts, to uniquely identify the kind of chart. All charts of the the same type, have exactly the same context.
In alarm templates, to automatically attach alarms to charts.
In prometheus exporter.

Changing the context will have minimal impact on backwards compatibility:

alarm templates created or edited by our users may need to be updated
This can probably be automated, by automatically changing all custom or edited alarms to the new contexts.
grafana dashboards created by our users based on our prometheus export may need to be updated
Although this cannot be automated, we supply a map of old and new contexts for our users to quickly adapt their systems.

How we should change the context

OpenTelemetry Metric Semantic Conventions provide some guide on how metric should be named:

limit - an instrument that measures the constant, known total amount of something should be called entity.limit. For example, system.memory.limit for the total amount of memory on a system.
usage - an instrument that measures an amount used out of a known total (limit) amount should be called entity.usage. For example, system.memory.usage with attribute state = used | cached | free | ... for the amount of memory in a each state. Where appropriate, the sum of usage over all attribute values SHOULD be equal to the limit.

A measure of the amount consumed of an unlimited resource, or of a resource whose limit is unknowable, is differentiated from usage. For example, the maximum possible amount of virtual memory that a process may consume may fluctuate over time and is not typically known.
utilization - an instrument that measures the fraction of usage out of its limit should be called entity.utilization. For example, system.memory.utilization for the fraction of memory in use. Utilization values are in the range [0, 1].
time - an instrument that measures passage of time should be called entity.time. For example, system.cpu.time with attribute state = idle | user | system | .... time measurements are not necessarily wall time and can be less than or greater than the real wall time between measurements.

time instruments are a special case of usage metrics, where the limit can usually be calculated as the sum of time over all attribute values. utilization for time instruments can be derived automatically using metric event timestamps. For example, system.cpu.utilization is defined as the difference in system.cpu.time measurements divided by the elapsed time.
io - an instrument that measures bidirectional data flow should be called entity.io and have attributes for direction. For example, system.network.io.
Other instruments that do not fit the above descriptions may be named more freely. For example, system.paging.faults and system.network.packets. Units do not need to be specified in the names since they are included during instrument creation, but can be added if there is ambiguity.

The above are great, but they do not address another key parameter. How to name entities.

Netdata should have semantics for entities too

Naming entities is crucial to provide clarity in the monitoring platform. Take for example the system.memory.usage example given in the OpenTelemetry definition above:

In Linux, system.memory.usage will have attributes/dimensions free, used, cache and buffers
In FreeBSD, system.memory.usage will have attributes/dimensions free, active, inactive, wired and buffers

Having non-uniform dimensions under the same metric, will make aggregations impossible, wrong or error prone.

We could have this:

system.memory.usage with variation: linux, the linux one
system.memory.usage with variation: freebsd, the freebsd one
system.memory.usage with variation: macos, the macos one
system.memory.usage with variation: windows, the windows one

A similar situation my arise when the same application, after some version enriches its metrics. So, let's assume that application X, has memory used and cache but in app version 2 they break down cache in data cache and index cache. How can we notify the users that 2 versions of the same metric exist and how we can set different alarms for each version? Having a variation field could help in that.

In time, all kinds of variations may happen:

new metrics introduced
old metrics being obsoleted
even metric changing meaning

To solve this problem Netdata alarm templates are already doing something similar, by filtering by os: https://github.com/netdata/netdata/blob/cabf89dfebb5441e2249760fde14afdb3739c91c/health/health.d/timex.conf#L7
This is however inefficient. What if we needed to apply a different alarm based on mysql version?

Having a more generic mechanism (whatever may change the attributes/dimensions and therefore the meaning of a metric) seems a lot better. So, we may say that mysql.memory.usage may exists with variation: mysql-1 and version: mysql-2 and different alarm templates may match the first or the second.

ktsaou

MemberAuthor

Using the context to build the table of contents of x-node dashboards

Cross-node, composite, overview dashboards are supposed to be the x-ray of the infrastructure. Today we build the table of contents and we group charts together based on context and many heuristics to make it meaningful. But this is becoming increasingly complicated...

By changing the context for all charts and adding a scope to them we may be in position to simplify this logic tremendously, while providing additional clarity to users about the infrastructure they run.

So, the menu could look like this (without taking into account the existing family field that groups charts into subcategories):

system
   + cpu
       + core
   + load
   + memory
   + storage
      + disks          per block device
      + mountpoints    per mount point
      + filesystems    per filesystem (BTRS, EXT4, ZFS, NFS, etc)
   + network
      + ip
      + ipv4
      + ipv6
      + interfaces     per network interface
      + firewall
      + softnet
         + core
   + processes
   + idlejitter
   + interrupts
      + percpu
   + softirqs
      + core
   + uptime
process
   + systemd services
   + apps
   + users
   + groups
cgroups
sensors
power
   + ups
   + battery
mysql
nginx
weblog
etc...

papazach

Hello @ktsaou I uploaded in our Google Drive the requested export in .csv format (404MB gzipped, 5.4GB original size).
You may find it here.

Feel free to take a look and reach out in case something additional is required.

hugovalente-pm

Contributor

the extract above has been loaded to BigQ so that it would then be filtered to have an actual extract on google sheets the file is here

netdata-community-bot

This issue has been mentioned on the Netdata Community Forums. There might be relevant details there:

https://community.netdata.cloud/t/group-docker-containers-in-menu/2797/4

hugovalente-pm

mentioned this

on Jun 9, 2022

[BUG] Composite charts should not mix plugins together #351

hugovalente-pm

mentioned this

on Jul 21, 2022

Breaking changes for Netdata Agent 2.0 netdata/netdata#13347

hugovalente-pm

mentioned this

on Jul 28, 2022

[Feat]: Pin or star a section to always show it on top #527

added

added

changed the title ~~[-]Dashboard menu (table of contents) could be built using the context[/-]~~ Dashboard menu (table of contents) redesign to be driven by OpenTelemetry standards

on May 5, 2023

hugovalente-pm

Contributor

this depends on https://github.com/netdata/product/issues/3123

hugovalente-pm

assigned

ktsaou

on May 12, 2023

5 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

ktsaou

Labels

area/dashboardscloud-backendcloud-frontend

Type

No type

Projects

No projects

Milestone

[2023] Summer
Past due by Sep 1, 2023

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dashboard menu (table of contents) redesign to be driven by OpenTelemetry standards #354

Changing the context

How we should change the context

Netdata should have semantics for entities too

Using the context to build the table of contents of x-node dashboards

5 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Dashboard menu (table of contents) redesign to be driven by OpenTelemetry standards #354

Description

Activity

ktsaou commented on Mar 19, 2022

Changing the context

How we should change the context

Netdata should have semantics for entities too

ktsaou commented on Mar 19, 2022

Using the context to build the table of contents of x-node dashboards

papazach commented on Mar 22, 2022

hugovalente-pm commented on Apr 21, 2022

netdata-community-bot commented on May 11, 2022

hugovalente-pm commented on May 5, 2023

5 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions