Skip to content

Commit 3e373f3

Browse files
author
Xuewei Zhang
committed
Update READMEs
1 parent 764ebfd commit 3e373f3

File tree

2 files changed

+54
-4
lines changed

2 files changed

+54
-4
lines changed

README.md

+24-4
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,12 @@ List of supported problem daemons:
5959
| [KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json) | KernelDeadlock | A system log monitor monitors kernel log and reports problem according to predefined rules. |
6060
| [AbrtAdaptor](https://github.com/kubernetes/node-problem-detector/blob/master/config/abrt-adaptor.json) | None | Monitor ABRT log messages and report them further. ABRT (Automatic Bug Report Tool) is health monitoring daemon able to catch kernel problems as well as application crashes of various kinds occurred on the host. For more information visit the [link](https://github.com/abrt). |
6161
| [CustomPluginMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/custom-plugin-monitor.json) | On-demand(According to users configuration) | A custom plugin monitor for node-problem-detector to invoke and check various node problems with user defined check scripts. See proposal [here](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#). |
62+
| [SystemStatsMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/system-stats-monitor.json) | None(Could be added in the future) | A system stats monitor for node-problem-detector to collect various health-related system stats as metrics. See proposal [here](https://docs.google.com/document/d/1SeaUz6kBavI283Dq8GBpoEUDrHA2a795xtw0OvjM568/edit). |
63+
64+
# Exporter
65+
66+
An exporter is a component of node-problem-detector. It reports node problems and/or metrics to
67+
certain back end (e.g. Kubernetes API server, or Prometheus scrape endpoint).
6268

6369
# Usage
6470

@@ -67,16 +73,21 @@ List of supported problem daemons:
6773
* `--version`: Print current version of node-problem-detector.
6874
* `--address`: The address to bind the node problem detector server.
6975
* `--port`: The port to bind the node problem detector server. Use 0 to disable.
70-
* `--system-log-monitors`: List of paths to system log monitor configuration files, comma separated, e.g.
76+
* `--config.system-log-monitor`: List of paths to system log monitor configuration files, comma separated, e.g.
7177
[config/kernel-monitor.json](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json).
7278
Node problem detector will start a separate log monitor for each configuration. You can
7379
use different log monitors to monitor different system log.
74-
* `--custom-plugin-monitors`: List of paths to custom plugin monitor config files, comma separated, e.g.
80+
* `--config.custom-plugin-monitor`: List of paths to custom plugin monitor config files, comma separated, e.g.
7581
[config/custom-plugin-monitor.json](https://github.com/kubernetes/node-problem-detector/blob/master/config/custom-plugin-monitor.json).
7682
Node problem detector will start a separate custom plugin monitor for each configuration. You can
7783
use different custom plugin monitors to monitor different node problems.
84+
* `--config.system-stats-monitor`: List of paths to system stats monitor config files, comma separated, e.g.
85+
[config/system-stats-monitor.json](https://github.com/kubernetes/node-problem-detector/blob/master/config/system-stats-monitor.json).
86+
Node problem detector will start a separate system stats monitor for each configuration. You can
87+
use different system stats monitors to monitor different problem-related system stats.
88+
* `--enable-k8s-exporter`: Enables reporting to Kubernetes API server, default to `true`.
7889
* `--apiserver-override`: A URI parameter used to customize how node-problem-detector
79-
connects the apiserver. The format is same as the
90+
connects the apiserver. This is ignored if `--enable-k8s-exporter` is `false`. The format is same as the
8091
[`source`](https://github.com/kubernetes/heapster/blob/master/docs/source-configuration.md#kubernetes)
8192
flag of [Heapster](https://github.com/kubernetes/heapster).
8293
For example, to run without auth, use the following config:
@@ -85,6 +96,14 @@ For example, to run without auth, use the following config:
8596
```
8697
Refer [heapster docs](https://github.com/kubernetes/heapster/blob/master/docs/source-configuration.md#kubernetes) for a complete list of available options.
8798
* `--hostname-override`: A customized node name used for node-problem-detector to update conditions and emit events. node-problem-detector gets node name first from `hostname-override`, then `NODE_NAME` environment variable and finally fall back to `os.Hostname`.
99+
* `--prometheus-address`: The address to bind the Prometheus scrape endpoint, default to `127.0.0.1`.
100+
* `--prometheus-port`: The port to bind the Prometheus scrape endpoint, default to 20257. Use 0 to disable.
101+
102+
### Deprecated Flags
103+
104+
* `--system-log-monitors`: List of paths to system log monitor config files, comma separated. This option is deprecated, replaced by `--config.system-log-monitor`, and will be removed. NPD will panic if both `--system-log-monitors` and `--config.system-log-monitor` are set.
105+
106+
* `--custom-plugin-monitors`: List of paths to custom plugin monitor config files, comma separated. This option is deprecated, replaced by `--config.custom-plugin-monitor`, and will be removed. NPD will panic if both `--custom-plugin-monitors` and `--config.custom-plugin-monitor` are set.
88107

89108
## Build Image
90109

@@ -149,12 +168,13 @@ For example, to test [KernelMonitor](https://github.com/kubernetes/node-problem-
149168
1. ```make``` (build node-problem-detector locally)
150169
2. ```kubectl proxy --port=8080``` (make a running cluster's API server available locally)
151170
3. Update [KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json)'s ```logPath``` to your local kernel log directory. For example, on some Linux systems, it is ```/run/log/journal``` instead of ```/var/log/journal```.
152-
3. ```./bin/node-problem-detector --logtostderr --apiserver-override=http://127.0.0.1:8080?inClusterConfig=false --system-log-monitors=config/kernel-monitor.json --port=20256``` (or point to any API server address:port)
171+
3. ```./bin/node-problem-detector --logtostderr --apiserver-override=http://127.0.0.1:8080?inClusterConfig=false --config.system-log-monitor=config/kernel-monitor.json --config.system-stats-monitor=config/system-stats-monitor.json --port=20256 --prometheus-port=20257``` (or point to any API server address:port and Prometheus port)
153172
4. ```sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"```
154173
5. You can see ```KernelOops``` event in the node-problem-detector log.
155174
6. ```sudo sh -c "echo 'kernel: INFO: task docker:20744 blocked for more than 120 seconds.' >> /dev/kmsg"```
156175
7. You can see ```DockerHung``` event and condition in the node-problem-detector log.
157176
8. You can see ```DockerHung``` condition at [http://127.0.0.1:20256/conditions](http://127.0.0.1:20256/conditions).
177+
9. You can see disk related system metrics in Prometheus format at [http://127.0.0.1:20257/metrics](http://127.0.0.1:20257/metrics).
158178

159179
**Note**:
160180
- You can see more rule examples under [test/kernel_log_generator/problems](https://github.com/kubernetes/node-problem-detector/tree/master/test/kernel_log_generator/problems).

pkg/systemstatsmonitor/README.md

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# System Stats Monitor
2+
3+
*System Stats Monitor* is a problem daemon in node problem detector. It collects pre-defined health-related metrics from different system components. Each component may allow further detailed configurations.
4+
5+
Currently supported components are:
6+
7+
* disk
8+
9+
See example config file [here](https://github.com/kubernetes/node-problem-detector/blob/master/config/system-stats-monitor.json).
10+
11+
## Detailed Configuration Options
12+
13+
### Global Configurations
14+
15+
Data collection period can be specified globally in the config file, see `invokeInterval` at the [example](https://github.com/kubernetes/node-problem-detector/blob/master/config/system-stats-monitor.json).
16+
17+
### Disk
18+
19+
Below metrics are collected from `disk` component:
20+
21+
* `disk/io_time`: [# of milliseconds spent doing I/Os on this device](https://www.kernel.org/doc/Documentation/iostats.txt)
22+
* `disk/weighted_io`: [# of milliseconds spent doing I/Os on this device](https://www.kernel.org/doc/Documentation/iostats.txt)
23+
* `disk/avg_queue_len`: [average # of requests that was waiting in queue or being serviced during the last `invokeInterval`](https://www.xaprb.com/blog/2010/01/09/how-linux-iostat-computes-its-results/)
24+
25+
By setting the `metricsConfigs` field and `displayName` field ([example](https://github.com/kubernetes/node-problem-detector/blob/master/config/system-stats-monitor.json)), you can specify the list of metrics to be collected, and their display names on the Prometheus scaping endpoint. The name of the disk block device will be reported in the `device` metrics label.
26+
27+
And a few other options:
28+
* `includeRootBlk`: When set to `true`, add all block devices that's [not a slave or holder device](http://man7.org/linux/man-pages/man8/lsblk.8.html) to the list of disks that System Stats Monitor collects metrics from. When set to `false`, do not modify the list of disks that System Stats Monitor collects metrics from.
29+
* `includeAllAttachedBlk`: When set to `true`, add all currently attached block devices to the list of disks that System Stats Monitor collects metrics from. When set to `false`, do not modify the list of disks that System Stats Monitor collects metrics from.
30+
* `lsblkTimeout`: System Stats Monitor uses [`lsblk`](http://man7.org/linux/man-pages/man8/lsblk.8.html) to retrieve block devices information. This option sets the timeout for calling `lsblk` commands.

0 commit comments

Comments
 (0)