Skip to content

Querier S3 fetch failure results in 400 error and no retries #1246

@rlisagor

Description

@rlisagor

When using S3 as block storage, large queries often fail due to failure to read even a single key from S3. Here's an example of the logs for such a case:

ts=2019-02-25T22:08:48.037322434Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.Get level=debug from=1550583324 through=1550590447 matchers=2
ts=2019-02-25T22:08:48.037417273Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.Get level=debug metric=...
ts=2019-02-25T22:08:48.037471934Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.lookupSeriesByMetricNameMatchers level=debug metricName=... matchers=1
ts=2019-02-25T22:08:48.037521511Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.lookupSeriesByMetricNameMatcher level=debug metricName=... matcher="..."
ts=2019-02-25T22:08:48.037749053Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.lookupSeriesByMetricNameMatcher level=debug queries=1
ts=2019-02-25T22:08:48.0496863Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.lookupSeriesByMetricNameMatcher level=debug entries=14186
ts=2019-02-25T22:08:48.064778726Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.lookupSeriesByMetricNameMatcher level=debug ids=14186
ts=2019-02-25T22:08:48.064850921Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.lookupSeriesByMetricNameMatchers level=debug msg="post intersection" ids=14186
ts=2019-02-25T22:08:48.064881316Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.Get level=debug series-ids=14186
ts=2019-02-25T22:08:48.064917821Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.lookupChunksBySeries level=debug seriesIDs=14186
ts=2019-02-25T22:08:48.088699806Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.lookupChunksBySeries level=debug queries=14186
ts=2019-02-25T22:08:48.242939072Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.lookupChunksBySeries level=debug entries=14002
ts=2019-02-25T22:08:48.252688896Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.Get level=debug chunk-ids=14002
ts=2019-02-25T22:08:48.266296887Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.Get level=debug chunks-post-filtering=14002
ts=2019-02-25T22:08:55.707756185Z caller=spanlogger.go:36 org_id=... trace_id=30d4f854cbad9666 method=SeriesStore.Get level=error msg=FetchChunks err="RequestError: send request failed\ncaused by: Get https://...: net/http: TLS handshake timeout"

This returns 400 (Bad Request) to the caller. The Query Frontend only retries non-HTTP errors or 5xx errors so this 400 returns straight to the end user.

Frontend aside, it might be worth adding some retry behaviour in the Querier directly. Large reads from S3 like above fail quite often.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions