One Ingester PV is piling up the collected data and not sending to S3 #4115

prprasad2020 · 2021-04-26T06:08:35Z

After deploying Cortex, ingesters took some time to stable. Mainly the cortex-ingester-0 was restarted several times before it is stable. After few hours all were working fine, but I notices ingester-0 is piling up the collected data for one Tenant. Because the disk utilization was going up considerably. This issue was not there in the other ingesters. Below is the disk utilization info of ingester-0

I could see some of the metrics of this tenant do not have data when I query from Grafana end also. When I did a manual flush using the ingester-0 UI, I could see below errors,

level=warn ts=2021-04-26T05:26:27.339450005Z caller=ingester_v2.go:1219 msg="TSDB blocks compaction for user has failed" user=edpub-nonprod err="reload blocks: head truncate failed: truncate chunks.HeadReadWriter: maxt of the files are not set" compactReason=forced

level=warn ts=2021-04-26T05:25:30.396896504Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=579.012µs err="rpc error: code = Code(400) desc = user=edpub-nonprod: series={__name__=\"container_tasks_state\", container=\"mobile-analytics-producer-ck\", endpoint=\"https-metrics\", id=\"/kubepods/burstable/pod5e47b02d-6932-4242-8268-94d6ba77b292/b89f0368edf422c75c31322b3026478478f282ada1e06734193eceddcc498907\", image=\"715824223074.dkr.ecr.us-east-1.amazonaws.com/mobile-analytics-producer-ck@sha256:7341caf25a488172718bd7cd10c4f8d3a805006113942424141969f1844c8e7b\", instance=\"10.50.169.155:10250\", job=\"kubelet\", metrics_path=\"/metrics/cadvisor\", name=\"k8s_mobile-analytics-producer-ck_mobile-analytics-producer-ck-6b4b77d5c8-6fgrb_mobiledev_5e47b02d-6932-4242-8268-94d6ba77b292_0\", namespace=\"mobiledev\", node=\"ip-10-50-169-155.ec2.internal\", pod=\"mobile-analytics-producer-ck-6b4b77d5c8-6fgrb\", prometheus=\"monitoring/prometheus-kube-prometheus-prometheus\", prometheus_replica=\"prometheus-prometheus-kube-prometheus-prometheus-0\", service=\"prometheus-kube-prometheus-kubelet\", state=\"uninterruptible\"}, timestamp=2021-04-26T05:24:53.599Z: out of bounds" msg="gRPC\n"

As a temporary fix, I released the PV, created a new one and attached it to ingester-0. I found few threads related to these errors in the slack channel as well, I'll try the fixes from these as well and post an update here,

The text was updated successfully, but these errors were encountered:

pracucci · 2021-04-27T15:37:21Z

Mainly the cortex-ingester-0 was restarted several times before it is stable.

Why was restarted? Getting OOMKilled, the process was panicking or what else?

out of bounds

This error means the timestamp if the sample remote written is "too old". The rule of thumb is that with the blocks storage we can't ingest any sample older than 1h than the most recent sample ingested for that tenant.

prprasad2020 · 2021-04-28T06:01:11Z

Why was restarted? Getting OOMKilled, the process was panicking or what else?

Yes, I deployed Cortex freshly, but ingesters were getting OOMKilled several times and restarted several times. Only after couple of hours those got stable and worked as expected. Is there any way to fix this ? I tried increasing the replicas count and cpu/mem limits, only after those changes ingesters got stable, but still after one or two hours.

This error means the timestamp if the sample remote written is "too old". The rule of thumb is that with the blocks storage we can't ingest any sample older than 1h than the most recent sample ingested for that tenant.

Got it, that could happened, since the ingesters were not stable for couple of hours, then those are trying to process old data.

prprasad2020 · 2021-05-11T13:44:11Z

Found out that "out of bounds" issue is already reported and still in Open state,

#2366

Found that below error is also fixed with Cortex 1.5.0 and I'm currently using 1.4.0

reload blocks: head truncate failed: truncate chunks.HeadReadWriter: maxt of the files are not set

#3342

I'll try it out and update here if there is any issues.

pracucci added the storage/blocks Blocks storage engine label Apr 27, 2021

prprasad2020 closed this as completed May 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One Ingester PV is piling up the collected data and not sending to S3 #4115

One Ingester PV is piling up the collected data and not sending to S3 #4115

prprasad2020 commented Apr 26, 2021

pracucci commented Apr 27, 2021

prprasad2020 commented Apr 28, 2021 •

edited

Loading

prprasad2020 commented May 11, 2021

One Ingester PV is piling up the collected data and not sending to S3 #4115

One Ingester PV is piling up the collected data and not sending to S3 #4115

Comments

prprasad2020 commented Apr 26, 2021

pracucci commented Apr 27, 2021

prprasad2020 commented Apr 28, 2021 • edited Loading

prprasad2020 commented May 11, 2021

prprasad2020 commented Apr 28, 2021 •

edited

Loading