Skip to content

Commit 975dc71

Browse files
authored
Merge pull request #275 from xueweiz/exp
node-problem-detector: report disk queue length in Prometheus format
2 parents df2bc3d + cf66246 commit 975dc71

29 files changed

+1486
-101
lines changed

README.md

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,12 @@ List of supported problem daemons:
5959
| [KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json) | KernelDeadlock | A system log monitor monitors kernel log and reports problem according to predefined rules. |
6060
| [AbrtAdaptor](https://github.com/kubernetes/node-problem-detector/blob/master/config/abrt-adaptor.json) | None | Monitor ABRT log messages and report them further. ABRT (Automatic Bug Report Tool) is health monitoring daemon able to catch kernel problems as well as application crashes of various kinds occurred on the host. For more information visit the [link](https://github.com/abrt). |
6161
| [CustomPluginMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/custom-plugin-monitor.json) | On-demand(According to users configuration) | A custom plugin monitor for node-problem-detector to invoke and check various node problems with user defined check scripts. See proposal [here](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#). |
62+
| [SystemStatsMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/system-stats-monitor.json) | None(Could be added in the future) | A system stats monitor for node-problem-detector to collect various health-related system stats as metrics. See proposal [here](https://docs.google.com/document/d/1SeaUz6kBavI283Dq8GBpoEUDrHA2a795xtw0OvjM568/edit). |
63+
64+
# Exporter
65+
66+
An exporter is a component of node-problem-detector. It reports node problems and/or metrics to
67+
certain back end (e.g. Kubernetes API server, or Prometheus scrape endpoint).
6268

6369
# Usage
6470

@@ -67,16 +73,21 @@ List of supported problem daemons:
6773
* `--version`: Print current version of node-problem-detector.
6874
* `--address`: The address to bind the node problem detector server.
6975
* `--port`: The port to bind the node problem detector server. Use 0 to disable.
70-
* `--system-log-monitors`: List of paths to system log monitor configuration files, comma separated, e.g.
76+
* `--config.system-log-monitor`: List of paths to system log monitor configuration files, comma separated, e.g.
7177
[config/kernel-monitor.json](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json).
7278
Node problem detector will start a separate log monitor for each configuration. You can
7379
use different log monitors to monitor different system log.
74-
* `--custom-plugin-monitors`: List of paths to custom plugin monitor config files, comma separated, e.g.
80+
* `--config.custom-plugin-monitor`: List of paths to custom plugin monitor config files, comma separated, e.g.
7581
[config/custom-plugin-monitor.json](https://github.com/kubernetes/node-problem-detector/blob/master/config/custom-plugin-monitor.json).
7682
Node problem detector will start a separate custom plugin monitor for each configuration. You can
7783
use different custom plugin monitors to monitor different node problems.
84+
* `--config.system-stats-monitor`: List of paths to system stats monitor config files, comma separated, e.g.
85+
[config/system-stats-monitor.json](https://github.com/kubernetes/node-problem-detector/blob/master/config/system-stats-monitor.json).
86+
Node problem detector will start a separate system stats monitor for each configuration. You can
87+
use different system stats monitors to monitor different problem-related system stats.
88+
* `--enable-k8s-exporter`: Enables reporting to Kubernetes API server, default to `true`.
7889
* `--apiserver-override`: A URI parameter used to customize how node-problem-detector
79-
connects the apiserver. The format is same as the
90+
connects the apiserver. This is ignored if `--enable-k8s-exporter` is `false`. The format is same as the
8091
[`source`](https://github.com/kubernetes/heapster/blob/master/docs/source-configuration.md#kubernetes)
8192
flag of [Heapster](https://github.com/kubernetes/heapster).
8293
For example, to run without auth, use the following config:
@@ -85,6 +96,14 @@ For example, to run without auth, use the following config:
8596
```
8697
Refer [heapster docs](https://github.com/kubernetes/heapster/blob/master/docs/source-configuration.md#kubernetes) for a complete list of available options.
8798
* `--hostname-override`: A customized node name used for node-problem-detector to update conditions and emit events. node-problem-detector gets node name first from `hostname-override`, then `NODE_NAME` environment variable and finally fall back to `os.Hostname`.
99+
* `--prometheus-address`: The address to bind the Prometheus scrape endpoint, default to `127.0.0.1`.
100+
* `--prometheus-port`: The port to bind the Prometheus scrape endpoint, default to 20257. Use 0 to disable.
101+
102+
### Deprecated Flags
103+
104+
* `--system-log-monitors`: List of paths to system log monitor config files, comma separated. This option is deprecated, replaced by `--config.system-log-monitor`, and will be removed. NPD will panic if both `--system-log-monitors` and `--config.system-log-monitor` are set.
105+
106+
* `--custom-plugin-monitors`: List of paths to custom plugin monitor config files, comma separated. This option is deprecated, replaced by `--config.custom-plugin-monitor`, and will be removed. NPD will panic if both `--custom-plugin-monitors` and `--config.custom-plugin-monitor` are set.
88107

89108
## Build Image
90109

@@ -149,12 +168,13 @@ For example, to test [KernelMonitor](https://github.com/kubernetes/node-problem-
149168
1. ```make``` (build node-problem-detector locally)
150169
2. ```kubectl proxy --port=8080``` (make a running cluster's API server available locally)
151170
3. Update [KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json)'s ```logPath``` to your local kernel log directory. For example, on some Linux systems, it is ```/run/log/journal``` instead of ```/var/log/journal```.
152-
3. ```./bin/node-problem-detector --logtostderr --apiserver-override=http://127.0.0.1:8080?inClusterConfig=false --system-log-monitors=config/kernel-monitor.json --port=20256``` (or point to any API server address:port)
171+
3. ```./bin/node-problem-detector --logtostderr --apiserver-override=http://127.0.0.1:8080?inClusterConfig=false --config.system-log-monitor=config/kernel-monitor.json --config.system-stats-monitor=config/system-stats-monitor.json --port=20256 --prometheus-port=20257``` (or point to any API server address:port and Prometheus port)
153172
4. ```sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"```
154173
5. You can see ```KernelOops``` event in the node-problem-detector log.
155174
6. ```sudo sh -c "echo 'kernel: INFO: task docker:20744 blocked for more than 120 seconds.' >> /dev/kmsg"```
156175
7. You can see ```DockerHung``` event and condition in the node-problem-detector log.
157176
8. You can see ```DockerHung``` condition at [http://127.0.0.1:20256/conditions](http://127.0.0.1:20256/conditions).
177+
9. You can see disk related system metrics in Prometheus format at [http://127.0.0.1:20257/metrics](http://127.0.0.1:20257/metrics).
158178

159179
**Note**:
160180
- You can see more rule examples under [test/kernel_log_generator/problems](https://github.com/kubernetes/node-problem-detector/tree/master/test/kernel_log_generator/problems).

cmd/node_problem_detector.go

Lines changed: 21 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -17,43 +17,20 @@ limitations under the License.
1717
package main
1818

1919
import (
20-
"net"
21-
"net/http"
22-
_ "net/http/pprof"
2320
"os"
24-
"strconv"
2521

2622
"github.com/golang/glog"
2723
"github.com/spf13/pflag"
2824

2925
"k8s.io/node-problem-detector/cmd/options"
30-
"k8s.io/node-problem-detector/pkg/custompluginmonitor"
31-
"k8s.io/node-problem-detector/pkg/problemclient"
26+
"k8s.io/node-problem-detector/pkg/exporters/k8sexporter"
27+
"k8s.io/node-problem-detector/pkg/exporters/prometheusexporter"
28+
"k8s.io/node-problem-detector/pkg/problemdaemon"
3229
"k8s.io/node-problem-detector/pkg/problemdetector"
33-
"k8s.io/node-problem-detector/pkg/systemlogmonitor"
3430
"k8s.io/node-problem-detector/pkg/types"
3531
"k8s.io/node-problem-detector/pkg/version"
3632
)
3733

38-
func startHTTPServer(p problemdetector.ProblemDetector, npdo *options.NodeProblemDetectorOptions) {
39-
// Add healthz http request handler. Always return ok now, add more health check
40-
// logic in the future.
41-
http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
42-
w.WriteHeader(http.StatusOK)
43-
w.Write([]byte("ok"))
44-
})
45-
// Add the http handlers in problem detector.
46-
p.RegisterHTTPHandlers()
47-
48-
addr := net.JoinHostPort(npdo.ServerAddress, strconv.Itoa(npdo.ServerPort))
49-
go func() {
50-
err := http.ListenAndServe(addr, nil)
51-
if err != nil {
52-
glog.Fatalf("Failed to start server: %v", err)
53-
}
54-
}()
55-
}
56-
5734
func main() {
5835
npdo := options.NewNodeProblemDetectorOptions()
5936
npdo.AddFlags(pflag.CommandLine)
@@ -66,35 +43,31 @@ func main() {
6643
}
6744

6845
npdo.SetNodeNameOrDie()
69-
46+
npdo.SetConfigFromDeprecatedOptionsOrDie()
7047
npdo.ValidOrDie()
7148

72-
monitors := make(map[string]types.Monitor)
73-
for _, config := range npdo.SystemLogMonitorConfigPaths {
74-
if _, ok := monitors[config]; ok {
75-
// Skip the config if it's duplicated.
76-
glog.Warningf("Duplicated monitor configuration %q", config)
77-
continue
78-
}
79-
monitors[config] = systemlogmonitor.NewLogMonitorOrDie(config)
49+
// Initialize problem daemons.
50+
problemDaemons := problemdaemon.NewProblemDaemons(npdo.MonitorConfigPaths)
51+
if len(problemDaemons) == 0 {
52+
glog.Fatalf("No problem daemon is configured")
8053
}
8154

82-
for _, config := range npdo.CustomPluginMonitorConfigPaths {
83-
if _, ok := monitors[config]; ok {
84-
// Skip the config if it's duplicated.
85-
glog.Warningf("Duplicated monitor configuration %q", config)
86-
continue
87-
}
88-
monitors[config] = custompluginmonitor.NewCustomPluginMonitorOrDie(config)
55+
// Initialize exporters.
56+
exporters := []types.Exporter{}
57+
if ke := k8sexporter.NewExporterOrDie(npdo); ke != nil {
58+
exporters = append(exporters, ke)
59+
glog.Info("K8s exporter started.")
8960
}
90-
c := problemclient.NewClientOrDie(npdo)
91-
p := problemdetector.NewProblemDetector(monitors, c)
92-
93-
// Start http server.
94-
if npdo.ServerPort > 0 {
95-
startHTTPServer(p, npdo)
61+
if pe := prometheusexporter.NewExporterOrDie(npdo); pe != nil {
62+
exporters = append(exporters, pe)
63+
glog.Info("Prometheus exporter started.")
64+
}
65+
if len(exporters) == 0 {
66+
glog.Fatalf("No exporter is successfully setup")
9667
}
9768

69+
// Initialize NPD core.
70+
p := problemdetector.NewProblemDetector(problemDaemons, exporters)
9871
if err := p.Run(); err != nil {
9972
glog.Fatalf("Problem detector failed with error: %v", err)
10073
}

cmd/options/options.go

Lines changed: 103 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -24,20 +24,17 @@ import (
2424
"net/url"
2525

2626
"github.com/spf13/pflag"
27+
28+
"k8s.io/node-problem-detector/pkg/custompluginmonitor"
29+
"k8s.io/node-problem-detector/pkg/problemdaemon"
30+
"k8s.io/node-problem-detector/pkg/systemlogmonitor"
31+
"k8s.io/node-problem-detector/pkg/types"
2732
)
2833

2934
// NodeProblemDetectorOptions contains node problem detector command line and application options.
3035
type NodeProblemDetectorOptions struct {
3136
// command line options
3237

33-
// SystemLogMonitorConfigPaths specifies the list of paths to system log monitor configuration
34-
// files.
35-
SystemLogMonitorConfigPaths []string
36-
// CustomPluginMonitorConfigPaths specifies the list of paths to custom plugin monitor configuration
37-
// files.
38-
CustomPluginMonitorConfigPaths []string
39-
// ApiServerOverride is the custom URI used to connect to Kubernetes ApiServer.
40-
ApiServerOverride string
4138
// PrintVersion is the flag determining whether version information is printed.
4239
PrintVersion bool
4340
// HostnameOverride specifies custom node name used to override hostname.
@@ -47,41 +44,134 @@ type NodeProblemDetectorOptions struct {
4744
// ServerAddress is the address to bind the node problem detector server.
4845
ServerAddress string
4946

47+
// exporter options
48+
49+
// k8sExporter options
50+
// EnableK8sExporter is the flag determining whether to report to Kubernetes.
51+
EnableK8sExporter bool
52+
// ApiServerOverride is the custom URI used to connect to Kubernetes ApiServer.
53+
ApiServerOverride string
54+
55+
// prometheusExporter options
56+
// PrometheusServerPort is the port to bind the Prometheus scrape endpoint. Use 0 to disable.
57+
PrometheusServerPort int
58+
// PrometheusServerAddress is the address to bind the Prometheus scrape endpoint.
59+
PrometheusServerAddress string
60+
61+
// problem daemon options
62+
63+
// SystemLogMonitorConfigPaths specifies the list of paths to system log monitor configuration
64+
// files.
65+
// SystemLogMonitorConfigPaths is used by the deprecated option --system-log-monitors. The new
66+
// option --config.system-log-monitor will stored the config file paths in MonitorConfigPaths.
67+
SystemLogMonitorConfigPaths []string
68+
// CustomPluginMonitorConfigPaths specifies the list of paths to custom plugin monitor configuration
69+
// files.
70+
// CustomPluginMonitorConfigPaths is used by the deprecated option --custom-plugin-monitors. The
71+
// new option --config.custom-plugin-monitor will stored the config file paths in MonitorConfigPaths.
72+
CustomPluginMonitorConfigPaths []string
73+
// MonitorConfigPaths specifies the list of paths to configuration files for each monitor.
74+
MonitorConfigPaths types.ProblemDaemonConfigPathMap
75+
5076
// application options
5177

5278
// NodeName is the node name used to communicate with Kubernetes ApiServer.
5379
NodeName string
5480
}
5581

5682
func NewNodeProblemDetectorOptions() *NodeProblemDetectorOptions {
57-
return &NodeProblemDetectorOptions{}
83+
return &NodeProblemDetectorOptions{MonitorConfigPaths: types.ProblemDaemonConfigPathMap{}}
5884
}
5985

6086
// AddFlags adds node problem detector command line options to pflag.
6187
func (npdo *NodeProblemDetectorOptions) AddFlags(fs *pflag.FlagSet) {
6288
fs.StringSliceVar(&npdo.SystemLogMonitorConfigPaths, "system-log-monitors",
6389
[]string{}, "List of paths to system log monitor config files, comma separated.")
90+
fs.MarkDeprecated("system-log-monitors", "replaced by --config.system-log-monitor. NPD will panic if both --system-log-monitors and --config.system-log-monitor are set.")
6491
fs.StringSliceVar(&npdo.CustomPluginMonitorConfigPaths, "custom-plugin-monitors",
6592
[]string{}, "List of paths to custom plugin monitor config files, comma separated.")
93+
fs.MarkDeprecated("custom-plugin-monitors", "replaced by --config.custom-plugin-monitor. NPD will panic if both --custom-plugin-monitors and --config.custom-plugin-monitor are set.")
94+
fs.BoolVar(&npdo.EnableK8sExporter, "enable-k8s-exporter", true, "Enables reporting to Kubernetes API server.")
6695
fs.StringVar(&npdo.ApiServerOverride, "apiserver-override",
67-
"", "Custom URI used to connect to Kubernetes ApiServer")
96+
"", "Custom URI used to connect to Kubernetes ApiServer. This is ignored if --enable-k8s-exporter is false.")
6897
fs.BoolVar(&npdo.PrintVersion, "version", false, "Print version information and quit")
6998
fs.StringVar(&npdo.HostnameOverride, "hostname-override",
7099
"", "Custom node name used to override hostname")
71100
fs.IntVar(&npdo.ServerPort, "port",
72101
20256, "The port to bind the node problem detector server. Use 0 to disable.")
73102
fs.StringVar(&npdo.ServerAddress, "address",
74103
"127.0.0.1", "The address to bind the node problem detector server.")
104+
105+
fs.IntVar(&npdo.PrometheusServerPort, "prometheus-port",
106+
20257, "The port to bind the Prometheus scrape endpoint. Prometheus exporter is enabled by default at port 20257. Use 0 to disable.")
107+
fs.StringVar(&npdo.PrometheusServerAddress, "prometheus-address",
108+
"127.0.0.1", "The address to bind the Prometheus scrape endpoint.")
109+
110+
for _, problemDaemonName := range problemdaemon.GetProblemDaemonNames() {
111+
npdo.MonitorConfigPaths[problemDaemonName] = &[]string{}
112+
fs.StringSliceVar(
113+
npdo.MonitorConfigPaths[problemDaemonName],
114+
"config."+string(problemDaemonName),
115+
[]string{},
116+
fmt.Sprintf("Comma separated configurations for %v monitor. %v",
117+
problemDaemonName,
118+
problemdaemon.GetProblemDaemonHandlerOrDie(problemDaemonName).CmdOptionDescription))
119+
}
75120
}
76121

77122
// ValidOrDie validates node problem detector command line options.
78123
func (npdo *NodeProblemDetectorOptions) ValidOrDie() {
79-
if _, err := url.Parse(npdo.ApiServerOverride); err != nil {
124+
if _, err := url.Parse(npdo.ApiServerOverride); npdo.EnableK8sExporter && err != nil {
80125
panic(fmt.Sprintf("apiserver-override %q is not a valid HTTP URI: %v",
81126
npdo.ApiServerOverride, err))
82127
}
83-
if len(npdo.SystemLogMonitorConfigPaths) == 0 && len(npdo.CustomPluginMonitorConfigPaths) == 0 {
84-
panic(fmt.Sprintf("Either --system-log-monitors or --custom-plugin-monitors is required"))
128+
129+
if len(npdo.SystemLogMonitorConfigPaths) != 0 {
130+
panic("SystemLogMonitorConfigPaths is deprecated. It should have been reassigned to MonitorConfigPaths. This should not happen.")
131+
}
132+
if len(npdo.CustomPluginMonitorConfigPaths) != 0 {
133+
panic("CustomPluginMonitorConfigPaths is deprecated. It should have been reassigned to MonitorConfigPaths. This should not happen.")
134+
}
135+
136+
configCount := 0
137+
for _, problemDaemonConfigPaths := range npdo.MonitorConfigPaths {
138+
configCount += len(*problemDaemonConfigPaths)
139+
}
140+
if configCount == 0 {
141+
panic("No configuration option for any problem daemon is specified.")
142+
}
143+
}
144+
145+
// SetConfigFromDeprecatedOptionsOrDie sets NPD option using deprecated options.
146+
func (npdo *NodeProblemDetectorOptions) SetConfigFromDeprecatedOptionsOrDie() {
147+
if len(npdo.SystemLogMonitorConfigPaths) != 0 {
148+
if npdo.MonitorConfigPaths[systemlogmonitor.SystemLogMonitorName] == nil {
149+
npdo.MonitorConfigPaths[systemlogmonitor.SystemLogMonitorName] = &[]string{}
150+
}
151+
152+
if len(*npdo.MonitorConfigPaths[systemlogmonitor.SystemLogMonitorName]) != 0 {
153+
panic("Option --system-log-monitors is deprecated in favor of --config.system-log-monitor. They cannot be set at the same time.")
154+
}
155+
156+
*npdo.MonitorConfigPaths[systemlogmonitor.SystemLogMonitorName] = append(
157+
*npdo.MonitorConfigPaths[systemlogmonitor.SystemLogMonitorName],
158+
npdo.SystemLogMonitorConfigPaths...)
159+
npdo.SystemLogMonitorConfigPaths = []string{}
160+
}
161+
162+
if len(npdo.CustomPluginMonitorConfigPaths) != 0 {
163+
if npdo.MonitorConfigPaths[custompluginmonitor.CustomPluginMonitorName] == nil {
164+
npdo.MonitorConfigPaths[custompluginmonitor.CustomPluginMonitorName] = &[]string{}
165+
}
166+
167+
if len(*npdo.MonitorConfigPaths[custompluginmonitor.CustomPluginMonitorName]) != 0 {
168+
panic("Option --custom-plugin-monitors is deprecated in favor of --config.custom-plugin-monitor. They cannot be set at the same time.")
169+
}
170+
171+
*npdo.MonitorConfigPaths[custompluginmonitor.CustomPluginMonitorName] = append(
172+
*npdo.MonitorConfigPaths[custompluginmonitor.CustomPluginMonitorName],
173+
npdo.CustomPluginMonitorConfigPaths...)
174+
npdo.CustomPluginMonitorConfigPaths = []string{}
85175
}
86176
}
87177

0 commit comments

Comments
 (0)