Skip to content

Dashboard menu (table of contents) redesign to be driven by OpenTelemetry standards #354

@ktsaou

Description

@ktsaou
Member

The way our dashboard menu is currently built, relies on the chart names. This is hugely problematic, and for years I have been considering it "my shame".

https://github.com/netdata/netdata/blob/561557c5ae2f427033fb3809e52e8db7cbecf7f7/web/gui/main.js#L1361-L1363

So, I have been discussing OpenTelemetry with @ilyam8 and we were doing some diagrams on the wall:

WhatsApp Image 2022-03-18 at 23 28 52

I think we have a viable solution to re-use context in a way that will bring netdata a lot closer to OpenTelemetry and will also allow us to come with a much better logic for creating the table of contents of the dashboards.

A logic that may also fix #351 and #347 and probably will eliminate a ton more bugs waiting to be found.

To validate the above assumption I need the following export from the charts on the cloud:

chart_id chart_name context family plugin module units count
- - - - - - - the number of charts with this combination

@papazach is it possible to have this export?

Activity

added
bugSomething isn't working
and removed
bugSomething isn't working
on Mar 18, 2022
ktsaou

ktsaou commented on Mar 19, 2022

@ktsaou
MemberAuthor

The OpenTelemetry Timeseries Model, is like this:

In this low-level metrics data model, a Timeseries is defined by an entity consisting of several metadata properties:

  • Metric name
  • Attributes (dimensions)
  • Kind of point (integer, floating point, etc)
  • Unit of measurement

In netdata the Metric Name is the context. This is also done in our prometheus exporter, where the metrics we send are actually the contexts, with attributes/labels our dimensions. Example:

# curl 2>/dev/null 'http://localhost:19999/api/v1/allmetrics?format=prometheus' | grep "^netdata_net_net_kilobits_persec_average"          ✔ 
netdata_net_net_kilobits_persec_average{chart="net.enp59s0u2u4",family="enp59s0u2u4",dimension="received"} 33.0627131 1647684792000
netdata_net_net_kilobits_persec_average{chart="net.enp59s0u2u4",family="enp59s0u2u4",dimension="sent"} -58.6417316 1647684792000
netdata_net_net_kilobits_persec_average{chart="net.docker0",family="docker0",dimension="received"} 1.5386666 1647684792000
netdata_net_net_kilobits_persec_average{chart="net.docker0",family="docker0",dimension="sent"} -2.5285036 1647684792000
netdata_net_net_kilobits_persec_average{chart="net.virbr0",family="virbr0",dimension="received"} 0.0332308 1647684792000
netdata_net_net_kilobits_persec_average{chart="net.virbr0",family="virbr0",dimension="sent"} -0.5177434 1647684792000

Changing the context

The context in netdata is something easy to change. It is only used:

  1. In charts, to uniquely identify the kind of chart. All charts of the the same type, have exactly the same context.
  2. In alarm templates, to automatically attach alarms to charts.
  3. In prometheus exporter.

Changing the context will have minimal impact on backwards compatibility:

  1. alarm templates created or edited by our users may need to be updated
    This can probably be automated, by automatically changing all custom or edited alarms to the new contexts.

  2. grafana dashboards created by our users based on our prometheus export may need to be updated
    Although this cannot be automated, we supply a map of old and new contexts for our users to quickly adapt their systems.

How we should change the context

OpenTelemetry Metric Semantic Conventions provide some guide on how metric should be named:

  • limit - an instrument that measures the constant, known total amount of something should be called entity.limit. For example, system.memory.limit for the total amount of memory on a system.

  • usage - an instrument that measures an amount used out of a known total (limit) amount should be called entity.usage. For example, system.memory.usage with attribute state = used | cached | free | ... for the amount of memory in a each state. Where appropriate, the sum of usage over all attribute values SHOULD be equal to the limit.

    A measure of the amount consumed of an unlimited resource, or of a resource whose limit is unknowable, is differentiated from usage. For example, the maximum possible amount of virtual memory that a process may consume may fluctuate over time and is not typically known.

  • utilization - an instrument that measures the fraction of usage out of its limit should be called entity.utilization. For example, system.memory.utilization for the fraction of memory in use. Utilization values are in the range [0, 1].

  • time - an instrument that measures passage of time should be called entity.time. For example, system.cpu.time with attribute state = idle | user | system | .... time measurements are not necessarily wall time and can be less than or greater than the real wall time between measurements.

    time instruments are a special case of usage metrics, where the limit can usually be calculated as the sum of time over all attribute values. utilization for time instruments can be derived automatically using metric event timestamps. For example, system.cpu.utilization is defined as the difference in system.cpu.time measurements divided by the elapsed time.

  • io - an instrument that measures bidirectional data flow should be called entity.io and have attributes for direction. For example, system.network.io.

  • Other instruments that do not fit the above descriptions may be named more freely. For example, system.paging.faults and system.network.packets. Units do not need to be specified in the names since they are included during instrument creation, but can be added if there is ambiguity.

The above are great, but they do not address another key parameter. How to name entities.

Netdata should have semantics for entities too

Naming entities is crucial to provide clarity in the monitoring platform. Take for example the system.memory.usage example given in the OpenTelemetry definition above:

  • In Linux, system.memory.usage will have attributes/dimensions free, used, cache and buffers
  • In FreeBSD, system.memory.usage will have attributes/dimensions free, active, inactive, wired and buffers

Having non-uniform dimensions under the same metric, will make aggregations impossible, wrong or error prone.

We could have this:

  • system.memory.usage with variation: linux, the linux one
  • system.memory.usage with variation: freebsd, the freebsd one
  • system.memory.usage with variation: macos, the macos one
  • system.memory.usage with variation: windows, the windows one

A similar situation my arise when the same application, after some version enriches its metrics. So, let's assume that application X, has memory used and cache but in app version 2 they break down cache in data cache and index cache. How can we notify the users that 2 versions of the same metric exist and how we can set different alarms for each version? Having a variation field could help in that.

In time, all kinds of variations may happen:

  1. new metrics introduced
  2. old metrics being obsoleted
  3. even metric changing meaning

To solve this problem Netdata alarm templates are already doing something similar, by filtering by os: https://github.com/netdata/netdata/blob/cabf89dfebb5441e2249760fde14afdb3739c91c/health/health.d/timex.conf#L7
This is however inefficient. What if we needed to apply a different alarm based on mysql version?

Having a more generic mechanism (whatever may change the attributes/dimensions and therefore the meaning of a metric) seems a lot better. So, we may say that mysql.memory.usage may exists with variation: mysql-1 and version: mysql-2 and different alarm templates may match the first or the second.

ktsaou

ktsaou commented on Mar 19, 2022

@ktsaou
MemberAuthor

Using the context to build the table of contents of x-node dashboards

Cross-node, composite, overview dashboards are supposed to be the x-ray of the infrastructure. Today we build the table of contents and we group charts together based on context and many heuristics to make it meaningful. But this is becoming increasingly complicated...

By changing the context for all charts and adding a scope to them we may be in position to simplify this logic tremendously, while providing additional clarity to users about the infrastructure they run.

So, the menu could look like this (without taking into account the existing family field that groups charts into subcategories):

system
   + cpu
       + core
   + load
   + memory
   + storage
      + disks          per block device
      + mountpoints    per mount point
      + filesystems    per filesystem (BTRS, EXT4, ZFS, NFS, etc)
   + network
      + ip
      + ipv4
      + ipv6
      + interfaces     per network interface
      + firewall
      + softnet
         + core
   + processes
   + idlejitter
   + interrupts
      + percpu
   + softirqs
      + core
   + uptime
process
   + systemd services
   + apps
   + users
   + groups
cgroups
sensors
power
   + ups
   + battery
mysql
nginx
weblog
etc...
papazach

papazach commented on Mar 22, 2022

@papazach

Hello @ktsaou I uploaded in our Google Drive the requested export in .csv format (404MB gzipped, 5.4GB original size).
You may find it here.

Feel free to take a look and reach out in case something additional is required.

hugovalente-pm

hugovalente-pm commented on Apr 21, 2022

@hugovalente-pm
Contributor

the extract above has been loaded to BigQ so that it would then be filtered to have an actual extract on google sheets the file is here

netdata-community-bot

netdata-community-bot commented on May 11, 2022

@netdata-community-bot

This issue has been mentioned on the Netdata Community Forums. There might be relevant details there:

https://community.netdata.cloud/t/group-docker-containers-in-menu/2797/4

changed the title [-]Dashboard menu (table of contents) could be built using the context[/-] [+]Dashboard menu (table of contents) redesign to be driven by OpenTelemetry standards[/+] on May 5, 2023
hugovalente-pm

hugovalente-pm commented on May 5, 2023

@hugovalente-pm
Contributor

5 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @ktsaou@papazach@netdata-community-bot@hugovalente-pm

      Issue actions

        Dashboard menu (table of contents) redesign to be driven by OpenTelemetry standards · Issue #354 · netdata/netdata-cloud