Skip to content

Launching existing workspace fails with error message #4803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cjensenius opened this issue Jul 13, 2021 · 11 comments
Closed

Launching existing workspace fails with error message #4803

cjensenius opened this issue Jul 13, 2021 · 11 comments
Labels
type: incident Gitpod.io service is unstable

Comments

@cjensenius
Copy link

Bug description

I tried to launch an existing workspace (https://gitpod.io/start/#indigo-chickadee-cben01ge) I was using last night and receive the error below.

Error: build failed: rpc error: code = Internal desc = cannot create build volume: cannot create build volume:

Steps to reproduce

Start https://gitpod.io/start/#indigo-chickadee-cben01ge and receive error

Expected behavior

Launching normally

Example repository

No response

Anything else?

No response

@carlosdp
Copy link

No one on my team can start workspaces right now, this is on gitpod.io

@corneliusludmann corneliusludmann added the type: incident Gitpod.io service is unstable label Jul 13, 2021
@nickl98
Copy link

nickl98 commented Jul 13, 2021

samehere, same problem mentioned above.

@carlosdp
Copy link

This is still showing green: https://www.gitpodstatus.com/ ....

@corneliusludmann
Copy link
Contributor

Could you please try again and let me know if it works now?

@cjensenius
Copy link
Author

Confirming my workspace is launching without the previous error, will update with final results

@cjensenius
Copy link
Author

Workspace launched without issue, this is resolved for me, thanks a million!

@corneliusludmann
Copy link
Contributor

Thanks for confirming. One of our image builder failed to build images. Everything should work again.

@ArthurSens
Copy link
Contributor

Just for transparency, one of our nodes got a full disk and you folks were faster than our alerting system to catch this one 😅

@carlosdp
Copy link

Do y'all post postmortems somewhere? Because this was ongoing for quite a bit of time without any communication from Gitpod acknowledging something was broken. That's a pretty slow alert lol. It also seems this disk full error has been around for a long time, is there a permanent fix being worked on?

We lost an entire morning of work. If we are going to continue depending on GItpod.io as our primary dev environment, we need to at the very least know you're awake and working on fixing it when it's broken! The first thing on every incident runbook for any team I've ever been a part of is update the status page

@JohannesLandgraf
Copy link
Contributor

JohannesLandgraf commented Jul 13, 2021

Hi @carlosdp, we are very sorry for the trouble we caused this morning to you and your team. I wanted to take the time and shine some light on our current processes regarding incident response.

Timeline

  • The issue has been reported at 4:34pm CEST
  • Our internal alert fired at 4:56pm CEST
  • Issue has been resolved 5:17pm CEST

Currently, all our post-mortems are private in our internal Notion. On https://www.gitpodstatus.com/ we post everything classified as a critical and/or major incident. The incident severity classification is based on user impact and which services of gitpod are affected. Usually our full-disk alerts are pretty harmless (i.e. no direct user impact), which is why acc to our runbook we don't call every full disk an incident (and don't update the status page for this).

As mentioned in this blog-post, we have metrics that inform us about usage of compute resources (e.g. disk usage), but are improving on metrics that directly show us user-impact (e.g. workspace start failure rate).

To give you some more technical context on why you hit quite an edge case for our system: looking at our Grafana dashboards, right now we're running 89 nodes in production, spread over 7 clusters, that are spread into 2 regions (US and EU).
image
What happened in your particular case is that one disk was full, though with 88 other nodes it doesn't mean that we have clear user impact (users will land on other nodes). We have a single component called image-builder that is responsible for building new workspace images (actually one per region) and the single node that this component was running got a full disk.

I.e. already running workspaces were not affected at all, just like workspaces that are using images that were already built beforehand. In your case you needed a new workspace image in the US region which image-builder builder was not able to serve. Having better metrics measuring user-impact would have helped in that situation.

We definitely want and will get better at this. As mentioned in the blog post, we're iterating on how we do Site Reliability Engineering within the company and being able to measure user impact clearly is on the top-priority list.

Regarding a potential fix we will prioritise work on #4804 that automates the clean-up for image-builder, which would have fixed the problem in your particular case.

@carlosdp
Copy link

Thanks for the response 👍 I understand what happened now, and I see you are working toward fixing the underlying issue.

The only thing I'd stress is just how important that customer communication is during the incident. I've been there before, classifying whether or not an incident was "user-facing" and whether putting out comms about it is necessary. In this case, multiple customers were complaining, so there was clearly customer impact at that point.

The moment someone realizes there actually is something wrong, you gotta put a red or yellow thing on that status board. Even if it has no info yet and says "some customers are experiencing issues, we're looking into it." That simple act of being pro-active will save a lot of pain and win you a lot of loyalty, from my experience. I would even go back and retroactively add this incident with a link to this postmortem, because I'm sure there's some subset of customers that didn't see this thread or the forum thread and might think you guys didn't notice there was something wrong earlier.

Always better to be extremely quick with letting people know you know there's a problem, even more important than how quickly you remediate imo. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: incident Gitpod.io service is unstable
Projects
None yet
Development

No branches or pull requests

6 participants