Skip to content

[Bug]: trace BatchSpanProcessor::force_flush can deadlock #1395

@ipetkov

Description

@ipetkov

What happened?

I have an application which uses a multi-threaded tokio runtime. Calling trace_provider.force_flush() in our pre-shutdown routine consistently deadlocks the application (since the call never returns).

I suspect the internal usage of futures_executor::block_on is the culprit. My hypothesis is that calling force_flush from an async task blocks the runtime thread and prevents actual progress if the scheduler cannot cope with other tasks assigned to that thread. This hypothesis is further supported by the observation that starting an opentelemetry_otlp pipeline from a dedicated, single-threaded tokio runtime no longer exhibits the deadlocking scenario when trace_provider.force_flush() is called (notably because there's a separate thread available to handle the internal export tasks while the caller is blocked)

futures_executor::block_on(res_receiver)
.map_err(|err| TraceError::Other(err.into()))
.and_then(|identity| identity)

API Version

Not sure, using the opentelemetry-collector with Jaeger, so likely the latest API version

SDK Version

opentelemetry 0.21.0
opentelmetry-otlp 0.14.0
opentelemetry_sdk 0.21.1

What Exporters are you seeing the problem on?

OTLP

Relevant log output

No useful log output. I did some println debugging of the opentelemetry-sdk internals (with a local fork) which showed that once force_flush was called, the internal methods of the BatchSpanProcessor stopped processing messages

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions