Launching existing workspace fails with error message #4803

cjensenius · 2021-07-13T14:34:26Z

Bug description

I tried to launch an existing workspace (https://gitpod.io/start/#indigo-chickadee-cben01ge) I was using last night and receive the error below.

Error: build failed: rpc error: code = Internal desc = cannot create build volume: cannot create build volume:

Steps to reproduce

Start https://gitpod.io/start/#indigo-chickadee-cben01ge and receive error

Expected behavior

Launching normally

Example repository

No response

Anything else?

No response

carlosdp · 2021-07-13T15:08:27Z

No one on my team can start workspaces right now, this is on gitpod.io

nickl98 · 2021-07-13T15:10:04Z

samehere, same problem mentioned above.

carlosdp · 2021-07-13T15:15:12Z

This is still showing green: https://www.gitpodstatus.com/ ....

corneliusludmann · 2021-07-13T15:17:23Z

Could you please try again and let me know if it works now?

cjensenius · 2021-07-13T15:29:13Z

Confirming my workspace is launching without the previous error, will update with final results

cjensenius · 2021-07-13T15:31:25Z

Workspace launched without issue, this is resolved for me, thanks a million!

corneliusludmann · 2021-07-13T15:32:50Z

Thanks for confirming. One of our image builder failed to build images. Everything should work again.

ArthurSens · 2021-07-13T15:32:51Z

Just for transparency, one of our nodes got a full disk and you folks were faster than our alerting system to catch this one 😅

carlosdp · 2021-07-13T15:43:34Z

Do y'all post postmortems somewhere? Because this was ongoing for quite a bit of time without any communication from Gitpod acknowledging something was broken. That's a pretty slow alert lol. It also seems this disk full error has been around for a long time, is there a permanent fix being worked on?

We lost an entire morning of work. If we are going to continue depending on GItpod.io as our primary dev environment, we need to at the very least know you're awake and working on fixing it when it's broken! The first thing on every incident runbook for any team I've ever been a part of is update the status page

JohannesLandgraf · 2021-07-13T19:10:23Z

Hi @carlosdp, we are very sorry for the trouble we caused this morning to you and your team. I wanted to take the time and shine some light on our current processes regarding incident response.

Timeline

The issue has been reported at 4:34pm CEST
Our internal alert fired at 4:56pm CEST
Issue has been resolved 5:17pm CEST

Currently, all our post-mortems are private in our internal Notion. On https://www.gitpodstatus.com/ we post everything classified as a critical and/or major incident. The incident severity classification is based on user impact and which services of gitpod are affected. Usually our full-disk alerts are pretty harmless (i.e. no direct user impact), which is why acc to our runbook we don't call every full disk an incident (and don't update the status page for this).

As mentioned in this blog-post, we have metrics that inform us about usage of compute resources (e.g. disk usage), but are improving on metrics that directly show us user-impact (e.g. workspace start failure rate).

To give you some more technical context on why you hit quite an edge case for our system: looking at our Grafana dashboards, right now we're running 89 nodes in production, spread over 7 clusters, that are spread into 2 regions (US and EU).

What happened in your particular case is that one disk was full, though with 88 other nodes it doesn't mean that we have clear user impact (users will land on other nodes). We have a single component called image-builder that is responsible for building new workspace images (actually one per region) and the single node that this component was running got a full disk.

I.e. already running workspaces were not affected at all, just like workspaces that are using images that were already built beforehand. In your case you needed a new workspace image in the US region which image-builder builder was not able to serve. Having better metrics measuring user-impact would have helped in that situation.

We definitely want and will get better at this. As mentioned in the blog post, we're iterating on how we do Site Reliability Engineering within the company and being able to measure user impact clearly is on the top-priority list.

Regarding a potential fix we will prioritise work on #4804 that automates the clean-up for image-builder, which would have fixed the problem in your particular case.

carlosdp · 2021-07-14T21:33:02Z

Thanks for the response 👍 I understand what happened now, and I see you are working toward fixing the underlying issue.

The only thing I'd stress is just how important that customer communication is during the incident. I've been there before, classifying whether or not an incident was "user-facing" and whether putting out comms about it is necessary. In this case, multiple customers were complaining, so there was clearly customer impact at that point.

The moment someone realizes there actually is something wrong, you gotta put a red or yellow thing on that status board. Even if it has no info yet and says "some customers are experiencing issues, we're looking into it." That simple act of being pro-active will save a lot of pain and win you a lot of loyalty, from my experience. I would even go back and retroactively add this incident with a link to this postmortem, because I'm sure there's some subset of customers that didn't see this thread or the forum thread and might think you guys didn't notice there was something wrong earlier.

Always better to be extremely quick with letting people know you know there's a problem, even more important than how quickly you remediate imo. 😄

corneliusludmann added the type: incident Gitpod.io service is unstable label Jul 13, 2021

cjensenius closed this as completed Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Launching existing workspace fails with error message #4803

Launching existing workspace fails with error message #4803

cjensenius commented Jul 13, 2021

carlosdp commented Jul 13, 2021

Uh oh!

nickl98 commented Jul 13, 2021

Uh oh!

carlosdp commented Jul 13, 2021

Uh oh!

corneliusludmann commented Jul 13, 2021

Uh oh!

cjensenius commented Jul 13, 2021

Uh oh!

cjensenius commented Jul 13, 2021

Uh oh!

corneliusludmann commented Jul 13, 2021

Uh oh!

ArthurSens commented Jul 13, 2021

Uh oh!

carlosdp commented Jul 13, 2021

Uh oh!

JohannesLandgraf commented Jul 13, 2021 •

edited

Loading

Uh oh!

carlosdp commented Jul 14, 2021

Uh oh!

Launching existing workspace fails with error message #4803

Launching existing workspace fails with error message #4803

Comments

cjensenius commented Jul 13, 2021

Bug description

Steps to reproduce

Expected behavior

Example repository

Anything else?

carlosdp commented Jul 13, 2021

Uh oh!

nickl98 commented Jul 13, 2021

Uh oh!

carlosdp commented Jul 13, 2021

Uh oh!

corneliusludmann commented Jul 13, 2021

Uh oh!

cjensenius commented Jul 13, 2021

Uh oh!

cjensenius commented Jul 13, 2021

Uh oh!

corneliusludmann commented Jul 13, 2021

Uh oh!

ArthurSens commented Jul 13, 2021

Uh oh!

carlosdp commented Jul 13, 2021

Uh oh!

JohannesLandgraf commented Jul 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlosdp commented Jul 14, 2021

Uh oh!

JohannesLandgraf commented Jul 13, 2021 •

edited

Loading