Skip to content

One Ingester PV is piling up the collected data and not sending to S3 #4115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
prprasad2020 opened this issue Apr 26, 2021 · 3 comments
Closed
Labels
storage/blocks Blocks storage engine

Comments

@prprasad2020
Copy link

After deploying Cortex, ingesters took some time to stable. Mainly the cortex-ingester-0 was restarted several times before it is stable. After few hours all were working fine, but I notices ingester-0 is piling up the collected data for one Tenant. Because the disk utilization was going up considerably. This issue was not there in the other ingesters. Below is the disk utilization info of ingester-0

image

I could see some of the metrics of this tenant do not have data when I query from Grafana end also. When I did a manual flush using the ingester-0 UI, I could see below errors,

level=warn ts=2021-04-26T05:26:27.339450005Z caller=ingester_v2.go:1219 msg="TSDB blocks compaction for user has failed" user=edpub-nonprod err="reload blocks: head truncate failed: truncate chunks.HeadReadWriter: maxt of the files are not set" compactReason=forced

level=warn ts=2021-04-26T05:25:30.396896504Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=579.012µs err="rpc error: code = Code(400) desc = user=edpub-nonprod: series={__name__=\"container_tasks_state\", container=\"mobile-analytics-producer-ck\", endpoint=\"https-metrics\", id=\"/kubepods/burstable/pod5e47b02d-6932-4242-8268-94d6ba77b292/b89f0368edf422c75c31322b3026478478f282ada1e06734193eceddcc498907\", image=\"715824223074.dkr.ecr.us-east-1.amazonaws.com/mobile-analytics-producer-ck@sha256:7341caf25a488172718bd7cd10c4f8d3a805006113942424141969f1844c8e7b\", instance=\"10.50.169.155:10250\", job=\"kubelet\", metrics_path=\"/metrics/cadvisor\", name=\"k8s_mobile-analytics-producer-ck_mobile-analytics-producer-ck-6b4b77d5c8-6fgrb_mobiledev_5e47b02d-6932-4242-8268-94d6ba77b292_0\", namespace=\"mobiledev\", node=\"ip-10-50-169-155.ec2.internal\", pod=\"mobile-analytics-producer-ck-6b4b77d5c8-6fgrb\", prometheus=\"monitoring/prometheus-kube-prometheus-prometheus\", prometheus_replica=\"prometheus-prometheus-kube-prometheus-prometheus-0\", service=\"prometheus-kube-prometheus-kubelet\", state=\"uninterruptible\"}, timestamp=2021-04-26T05:24:53.599Z: out of bounds" msg="gRPC\n"

As a temporary fix, I released the PV, created a new one and attached it to ingester-0. I found few threads related to these errors in the slack channel as well, I'll try the fixes from these as well and post an update here,

  1. https://cloud-native.slack.com/archives/CCYDASBLP/p1602310486362800
  2. https://cloud-native.slack.com/archives/CCYDASBLP/p1616263979064200
@pracucci pracucci added the storage/blocks Blocks storage engine label Apr 27, 2021
@pracucci
Copy link
Contributor

Mainly the cortex-ingester-0 was restarted several times before it is stable.

Why was restarted? Getting OOMKilled, the process was panicking or what else?

out of bounds

This error means the timestamp if the sample remote written is "too old". The rule of thumb is that with the blocks storage we can't ingest any sample older than 1h than the most recent sample ingested for that tenant.

@prprasad2020
Copy link
Author

prprasad2020 commented Apr 28, 2021

Why was restarted? Getting OOMKilled, the process was panicking or what else?

Yes, I deployed Cortex freshly, but ingesters were getting OOMKilled several times and restarted several times. Only after couple of hours those got stable and worked as expected. Is there any way to fix this ? I tried increasing the replicas count and cpu/mem limits, only after those changes ingesters got stable, but still after one or two hours.

This error means the timestamp if the sample remote written is "too old". The rule of thumb is that with the blocks storage we can't ingest any sample older than 1h than the most recent sample ingested for that tenant.

Got it, that could happened, since the ingesters were not stable for couple of hours, then those are trying to process old data.

@prprasad2020
Copy link
Author

  1. Found out that "out of bounds" issue is already reported and still in Open state,

#2366

  1. Found that below error is also fixed with Cortex 1.5.0 and I'm currently using 1.4.0

reload blocks: head truncate failed: truncate chunks.HeadReadWriter: maxt of the files are not set

#3342

I'll try it out and update here if there is any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
storage/blocks Blocks storage engine
Projects
None yet
Development

No branches or pull requests

2 participants