Description
We are not using AWS S3 but a "compatible reimplementation of S3" from Quobyte and with more than a few parallel queries from larger dashboards we are getting errors in the frontend:
"cannot iterate chunk for series: {name="node_network_transmit_drop_total", customer="syseleven", datacenter="bki1", device="gretap0", environment="development", hardwarenode="vz-srv12345", instance="syseleven.testvlan6.sys11service01", job="node", platform="pvc7", project="testvlan6"}: EOF"
The queriers are logging these for failed queries:
level=error ts=2020-04-02T17:12:28.435445802Z caller=worker.go:167 msg="error processing requests" err="rpc error: code = Canceled desc = context canceled"
I did some tcpdumping and saw some GET Requests with Range Headers failing with HTTP Statuscode 416.
2502 34.422156 5.6.7.8 1.2.3.4 HTTP/XML 541 HTTP/1.1 416 Range Not Satisfiable
Since, we had at first issues because of our S3-compatible-implementation was rate-limiting, maybe there are still just too many requests. I also opened thanos-io/thanos#2343 in the hope it would address the issue with too many connections (was just merged). I wonder if AWS does rate-limiting as well and maybe this is still the main issue here.
I also had a long conversation in your slack channel over this topic here: https://cloud-native.slack.com/archives/CCYDASBLP/p1585211354038800
I can provide more information if needed.