Skip to content

Crons: Linking monitors to errors #43647

@evanpurkhiser

Description

@evanpurkhiser

Problem Statment

When using the sentry Crons feature errors produced during execution of a monitored task should be associated to the monitor ID. We should also associate errors and transactions to the specific check-in that is reported.

What exists now?

Currently monitors support linking to errors that occur during execution via the monitor.id tag.

This tag is promoted from the MonitorContext.

Right now we are manually setting up this context in a few different places.

  • When a checkin fails it sets the context on the created “checkin failed” error event

    🤔 Note that it also manually sets the monitor.id in the tags. I believe this may be a mistake since the context promotes the id to a tag

  • The sentry monitor utils that are used in our celery tasks set the monitor context to associate errors to the monitor

So what’s the problem

This works but it has a few issues

  1. Checkin failure events do not track which particular checkin triggered the failure
  2. Users have to manually set the monitors context to associate an error to a particular monitor
  3. Even when the monitor.id context is manually set, there is no association of error to checkin.

Proposed checkin association strategy

We should use our Trace ID to associate checkins to errors and transactions, realistically this means doing the following

  1. A new trace_id UUID column should be added to the MonitorCheckin table. This should be indexed as we will use it to lookup associated checkins

  2. The monitor context should continue to exist on errors so that we can easily query for errors that we know are part of a monitor

  3. The trace_id should be provided as part of the checkin API request. It will not be required

    🤔 Open Question: Should we generate a trace ID if they no not pass it, and if so should we return that as part of the result API response.

Proposed integration with sentry-cli and SDKs

Both the checkin and SDKs producing errors and traces must be aware of the monitor context and the Trace ID being used for the monitor run. Here’s what needs to happen

  1. The sentry-cli should generate a Trace ID and send that with the checkins.
  2. The sentry-cli monitor run <monitor_id> -- <command> should set two environment variables during execution of <command>
    1. SENTRY_TRACE_ID: The Trace ID generated by the CLI
      feat(monitors): Pass SENTRY_TRACE_ID down execution path sentry-cli#1441
    2. SENTRY_MONITOR_ID: The Monitor UUID
      feat(monitors): Pass SENTRY_MONITOR_ID to executing process sentry-cli#1438
  3. Sentry SDKs should be updated to understand both of these environment variables
    1. SENTRY_TRACE_ID should be used in place of the SDK generating it’s own trace ID
    2. SENTRY_MONITOR_ID should be used to setup the monitor context of any events that occur during execution
      feat(crons): Add CronMonitorIntegration sentry-python#1866

Updates

  1. At the moment propagating a SENTRY_TRACE_ID is more involved than the scope of this ticket, see Crons: Linking monitors to errors #43647 (comment)

Metadata

Metadata

Assignees

Projects

Status

Beta Availability

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions