Skip to content

Improve handling of preview deployment failures on main #9837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 6, 2022

Conversation

mads-hartmann
Copy link
Contributor

@mads-hartmann mads-hartmann commented May 6, 2022

Description

This PR improves how we handle preview deployment failures on main.

Rather than relying on catching errors everywhere and using exit 0 in the implementation of preview environment deployments, this moves the try/catch all the way to the top. This ensure that preview deployment errors never
breaks the main build.

Additionally the this improves a few other things as well

  • Using exit 0 means that all of our shutdown logic like flushing traces doesn't get executed which might result in missing spans. So getting rid of those is nice.
  • Always log the error that caused the job to fail. Previously, on the main build, the error would only be sent to Slack, and added to the span but wouldn't be visible in the actual Werft logs. This fixes that.
  • Using werft.fail will no longer end all spans. We have a finally handler that does that anyway, and ending all the spans in the fail code means we close the root span before the job is actually done (e.g. our top catch/finally logic hasn't been executed)

Related Issue(s)

Fixes #9659
Fixes https://github.com/gitpod-io/ops/issues/947

How to test

I have not tried to inject errors, so we won't know for sure this does the right thing until this is merged and we see a preview environment deployment failure on main.

Release Notes

NONE

Documentation

N/A

Previously the error would only be sent to Slack, and added to the span
but wouldn't be visible in the actual Werft logs. This fixes that
Rather than relying on catching errors everywhere and using exit 0 in the
implementation of preview environment deployments, this moves the try/catch
all the way to the top. This ensure that preview deployment errors never
breaks the main build.

Using exit 0 is also not ideal as it means we might not flush traces, so
getting rid of those is nice.
@mads-hartmann mads-hartmann force-pushed the mads-hartmann/werft-jobs-fail-often-9659 branch from c559563 to 9b83bff Compare May 6, 2022 19:24
@mads-hartmann mads-hartmann marked this pull request as ready for review May 6, 2022 19:33
@mads-hartmann mads-hartmann requested a review from a team May 6, 2022 19:42
Copy link
Member

@meysholdt meysholdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice to see all the "if(!main" gone :)

@roboquat roboquat merged commit dbc5805 into main May 6, 2022
@roboquat roboquat deleted the mads-hartmann/werft-jobs-fail-often-9659 branch May 6, 2022 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[werft]: jobs fail often during "Flushing telemetry before stopping job"
3 participants