You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After deploying Cortex, ingesters took some time to stable. Mainly the cortex-ingester-0 was restarted several times before it is stable. After few hours all were working fine, but I notices ingester-0 is piling up the collected data for one Tenant. Because the disk utilization was going up considerably. This issue was not there in the other ingesters. Below is the disk utilization info of ingester-0
I could see some of the metrics of this tenant do not have data when I query from Grafana end also. When I did a manual flush using the ingester-0 UI, I could see below errors,
level=warn ts=2021-04-26T05:26:27.339450005Z caller=ingester_v2.go:1219 msg="TSDB blocks compaction for user has failed" user=edpub-nonprod err="reload blocks: head truncate failed: truncate chunks.HeadReadWriter: maxt of the files are not set" compactReason=forced
As a temporary fix, I released the PV, created a new one and attached it to ingester-0. I found few threads related to these errors in the slack channel as well, I'll try the fixes from these as well and post an update here,
Mainly the cortex-ingester-0 was restarted several times before it is stable.
Why was restarted? Getting OOMKilled, the process was panicking or what else?
out of bounds
This error means the timestamp if the sample remote written is "too old". The rule of thumb is that with the blocks storage we can't ingest any sample older than 1h than the most recent sample ingested for that tenant.
Why was restarted? Getting OOMKilled, the process was panicking or what else?
Yes, I deployed Cortex freshly, but ingesters were getting OOMKilled several times and restarted several times. Only after couple of hours those got stable and worked as expected. Is there any way to fix this ? I tried increasing the replicas count and cpu/mem limits, only after those changes ingesters got stable, but still after one or two hours.
This error means the timestamp if the sample remote written is "too old". The rule of thumb is that with the blocks storage we can't ingest any sample older than 1h than the most recent sample ingested for that tenant.
Got it, that could happened, since the ingesters were not stable for couple of hours, then those are trying to process old data.
After deploying Cortex, ingesters took some time to stable. Mainly the cortex-ingester-0 was restarted several times before it is stable. After few hours all were working fine, but I notices ingester-0 is piling up the collected data for one Tenant. Because the disk utilization was going up considerably. This issue was not there in the other ingesters. Below is the disk utilization info of ingester-0
I could see some of the metrics of this tenant do not have data when I query from Grafana end also. When I did a manual flush using the ingester-0 UI, I could see below errors,
level=warn ts=2021-04-26T05:26:27.339450005Z caller=ingester_v2.go:1219 msg="TSDB blocks compaction for user has failed" user=edpub-nonprod err="reload blocks: head truncate failed: truncate chunks.HeadReadWriter: maxt of the files are not set" compactReason=forced
level=warn ts=2021-04-26T05:25:30.396896504Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=579.012µs err="rpc error: code = Code(400) desc = user=edpub-nonprod: series={__name__=\"container_tasks_state\", container=\"mobile-analytics-producer-ck\", endpoint=\"https-metrics\", id=\"/kubepods/burstable/pod5e47b02d-6932-4242-8268-94d6ba77b292/b89f0368edf422c75c31322b3026478478f282ada1e06734193eceddcc498907\", image=\"715824223074.dkr.ecr.us-east-1.amazonaws.com/mobile-analytics-producer-ck@sha256:7341caf25a488172718bd7cd10c4f8d3a805006113942424141969f1844c8e7b\", instance=\"10.50.169.155:10250\", job=\"kubelet\", metrics_path=\"/metrics/cadvisor\", name=\"k8s_mobile-analytics-producer-ck_mobile-analytics-producer-ck-6b4b77d5c8-6fgrb_mobiledev_5e47b02d-6932-4242-8268-94d6ba77b292_0\", namespace=\"mobiledev\", node=\"ip-10-50-169-155.ec2.internal\", pod=\"mobile-analytics-producer-ck-6b4b77d5c8-6fgrb\", prometheus=\"monitoring/prometheus-kube-prometheus-prometheus\", prometheus_replica=\"prometheus-prometheus-kube-prometheus-prometheus-0\", service=\"prometheus-kube-prometheus-kubelet\", state=\"uninterruptible\"}, timestamp=2021-04-26T05:24:53.599Z: out of bounds" msg="gRPC\n"
As a temporary fix, I released the PV, created a new one and attached it to ingester-0. I found few threads related to these errors in the slack channel as well, I'll try the fixes from these as well and post an update here,
The text was updated successfully, but these errors were encountered: