You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+24-4
Original file line number
Diff line number
Diff line change
@@ -59,6 +59,12 @@ List of supported problem daemons:
59
59
|[KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json)| KernelDeadlock | A system log monitor monitors kernel log and reports problem according to predefined rules. |
60
60
|[AbrtAdaptor](https://github.com/kubernetes/node-problem-detector/blob/master/config/abrt-adaptor.json)| None | Monitor ABRT log messages and report them further. ABRT (Automatic Bug Report Tool) is health monitoring daemon able to catch kernel problems as well as application crashes of various kinds occurred on the host. For more information visit the [link](https://github.com/abrt). |
61
61
|[CustomPluginMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/custom-plugin-monitor.json)| On-demand(According to users configuration) | A custom plugin monitor for node-problem-detector to invoke and check various node problems with user defined check scripts. See proposal [here](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#). |
62
+
|[SystemStatsMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/system-stats-monitor.json)| None(Could be added in the future) | A system stats monitor for node-problem-detector to collect various health-related system stats as metrics. See proposal [here](https://docs.google.com/document/d/1SeaUz6kBavI283Dq8GBpoEUDrHA2a795xtw0OvjM568/edit). |
63
+
64
+
# Exporter
65
+
66
+
An exporter is a component of node-problem-detector. It reports node problems and/or metrics to
67
+
certain back end (e.g. Kubernetes API server, or Prometheus scrape endpoint).
62
68
63
69
# Usage
64
70
@@ -67,16 +73,21 @@ List of supported problem daemons:
67
73
*`--version`: Print current version of node-problem-detector.
68
74
*`--address`: The address to bind the node problem detector server.
69
75
*`--port`: The port to bind the node problem detector server. Use 0 to disable.
70
-
*`--system-log-monitors`: List of paths to system log monitor configuration files, comma separated, e.g.
76
+
*`--config.system-log-monitor`: List of paths to system log monitor configuration files, comma separated, e.g.
flag of [Heapster](https://github.com/kubernetes/heapster).
82
93
For example, to run without auth, use the following config:
@@ -85,6 +96,14 @@ For example, to run without auth, use the following config:
85
96
```
86
97
Refer [heapster docs](https://github.com/kubernetes/heapster/blob/master/docs/source-configuration.md#kubernetes) for a complete list of available options.
87
98
*`--hostname-override`: A customized node name used for node-problem-detector to update conditions and emit events. node-problem-detector gets node name first from `hostname-override`, then `NODE_NAME` environment variable and finally fall back to `os.Hostname`.
99
+
*`--prometheus-address`: The address to bind the Prometheus scrape endpoint, default to `127.0.0.1`.
100
+
*`--prometheus-port`: The port to bind the Prometheus scrape endpoint, default to 20257. Use 0 to disable.
101
+
102
+
### Deprecated Flags
103
+
104
+
*`--system-log-monitors`: List of paths to system log monitor config files, comma separated. This option is deprecated, replaced by `--config.system-log-monitor`, and will be removed. NPD will panic if both `--system-log-monitors` and `--config.system-log-monitor` are set.
105
+
106
+
*`--custom-plugin-monitors`: List of paths to custom plugin monitor config files, comma separated. This option is deprecated, replaced by `--config.custom-plugin-monitor`, and will be removed. NPD will panic if both `--custom-plugin-monitors` and `--config.custom-plugin-monitor` are set.
88
107
89
108
## Build Image
90
109
@@ -149,12 +168,13 @@ For example, to test [KernelMonitor](https://github.com/kubernetes/node-problem-
2.```kubectl proxy --port=8080``` (make a running cluster's API server available locally)
151
170
3. Update [KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json)'s ```logPath``` to your local kernel log directory. For example, on some Linux systems, it is ```/run/log/journal``` instead of ```/var/log/journal```.
152
-
3.```./bin/node-problem-detector --logtostderr --apiserver-override=http://127.0.0.1:8080?inClusterConfig=false --system-log-monitors=config/kernel-monitor.json --port=20256``` (or point to any API server address:port)
171
+
3.```./bin/node-problem-detector --logtostderr --apiserver-override=http://127.0.0.1:8080?inClusterConfig=false --config.system-log-monitor=config/kernel-monitor.json --config.system-stats-monitor=config/system-stats-monitor.json --port=20256 --prometheus-port=20257``` (or point to any API server address:port and Prometheus port)
153
172
4.```sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"```
154
173
5. You can see ```KernelOops``` event in the node-problem-detector log.
155
174
6.```sudo sh -c "echo 'kernel: INFO: task docker:20744 blocked for more than 120 seconds.' >> /dev/kmsg"```
156
175
7. You can see ```DockerHung``` event and condition in the node-problem-detector log.
157
176
8. You can see ```DockerHung``` condition at [http://127.0.0.1:20256/conditions](http://127.0.0.1:20256/conditions).
177
+
9. You can see disk related system metrics in Prometheus format at [http://127.0.0.1:20257/metrics](http://127.0.0.1:20257/metrics).
158
178
159
179
**Note**:
160
180
- You can see more rule examples under [test/kernel_log_generator/problems](https://github.com/kubernetes/node-problem-detector/tree/master/test/kernel_log_generator/problems).
*System Stats Monitor* is a problem daemon in node problem detector. It collects pre-defined health-related metrics from different system components. Each component may allow further detailed configurations.
4
+
5
+
Currently supported components are:
6
+
7
+
* disk
8
+
9
+
See example config file [here](https://github.com/kubernetes/node-problem-detector/blob/master/config/system-stats-monitor.json).
10
+
11
+
## Detailed Configuration Options
12
+
13
+
### Global Configurations
14
+
15
+
Data collection period can be specified globally in the config file, see `invokeInterval` at the [example](https://github.com/kubernetes/node-problem-detector/blob/master/config/system-stats-monitor.json).
16
+
17
+
### Disk
18
+
19
+
Below metrics are collected from `disk` component:
20
+
21
+
*`disk/io_time`: [# of milliseconds spent doing I/Os on this device](https://www.kernel.org/doc/Documentation/iostats.txt)
22
+
*`disk/weighted_io`: [# of milliseconds spent doing I/Os on this device](https://www.kernel.org/doc/Documentation/iostats.txt)
23
+
*`disk/avg_queue_len`: [average # of requests that was waiting in queue or being serviced during the last `invokeInterval`](https://www.xaprb.com/blog/2010/01/09/how-linux-iostat-computes-its-results/)
24
+
25
+
By setting the `metricsConfigs` field and `displayName` field ([example](https://github.com/kubernetes/node-problem-detector/blob/master/config/system-stats-monitor.json)), you can specify the list of metrics to be collected, and their display names on the Prometheus scaping endpoint. The name of the disk block device will be reported in the `device` metrics label.
26
+
27
+
And a few other options:
28
+
*`includeRootBlk`: When set to `true`, add all block devices that's [not a slave or holder device](http://man7.org/linux/man-pages/man8/lsblk.8.html) to the list of disks that System Stats Monitor collects metrics from. When set to `false`, do not modify the list of disks that System Stats Monitor collects metrics from.
29
+
*`includeAllAttachedBlk`: When set to `true`, add all currently attached block devices to the list of disks that System Stats Monitor collects metrics from. When set to `false`, do not modify the list of disks that System Stats Monitor collects metrics from.
30
+
*`lsblkTimeout`: System Stats Monitor uses [`lsblk`](http://man7.org/linux/man-pages/man8/lsblk.8.html) to retrieve block devices information. This option sets the timeout for calling `lsblk` commands.
0 commit comments