You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
summary: Server accumulated too much "event loop lag". The webapp will become unresponsive if we don't act here.
34
+
summary: Server accumulated too much "event loop lag" on {{ $labels.cluster }}. The webapp will become unresponsive if we don't act here.
35
35
description: Server has accumulated {{ printf "%.2f" $value }}s event loop lag.
36
36
37
37
- alert: InstanceStartFailures
@@ -48,15 +48,15 @@ spec:
48
48
# Rollout alerts
49
49
- alert: JsonRpcApiErrorRates
50
50
# Reasoning: the values are taken from past data
51
-
expr: sum (rate(gitpod_server_api_calls_total{statusCode!~"2..|429"}[5m])) / sum(rate(gitpod_server_api_calls_total[5m])) > 0.04
51
+
expr: sum (rate(gitpod_server_api_calls_total{statusCode!~"2..|429"}[5m])) by (cluster) / sum(rate(gitpod_server_api_calls_total[5m])) by (cluster) > 0.04
52
52
for: 5m
53
53
labels:
54
54
# sent to the team internal channel until we fine tuned it
summary: The messagebus pod is not running. Workspace information is not being correctly propagated into web app clusters. Investigation required.
98
+
summary: The messagebus pod is not running in {{ $labels.cluster }}. Workspace information is not being correctly propagated into web app clusters. Investigation required.
99
99
description: Messagebus pod not running
100
100
101
101
- alert: WebAppServicesHighCPUUsage
102
102
# Reasoning: high rates of CPU consumption should only be temporary.
103
-
expr: sum(rate(container_cpu_usage_seconds_total{container!="POD", node=~".*", pod=~"(content-service|dashboard|db|db-sync|messagebus|payment-endpoint|proxy|server|ws-manager-bridge|usage)-.*"}[5m])) by (pod, node) > 0.80
103
+
expr: sum(rate(container_cpu_usage_seconds_total{container!="POD", node=~".*", pod=~"(content-service|dashboard|db|db-sync|messagebus|payment-endpoint|proxy|server|ws-manager-bridge|usage)-.*"}[5m])) by (pod, node, cluster) > 0.80
104
104
for: 10m
105
105
labels:
106
106
# sent to the team internal channel until we fine tuned it
@@ -114,13 +114,13 @@ spec:
114
114
115
115
- alert: WebAppServicesCrashlooping
116
116
# Reasoning: alert if any pod is restarting more than 3 times / 5 minutes.
0 commit comments