Skip to content

Stop running prebuilds for inactive projects (10+ weeks) #9219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 12, 2022

Conversation

jankeromnes
Copy link
Contributor

@jankeromnes jankeromnes commented Apr 11, 2022

Description

Stop running prebuilds for projects that did not start a workspace in the last 10+ weeks.

Related Issue(s)

Fixes #8911
Drive-by fixes prebuild rate limits again

How to test

  1. Add a Project
  2. Verify that Prebuilds are triggered successfully and workspaces can be started
  3. Verify that lastWorkspaceStart and lastWebhookReceived are properly set in d_b_project_usage
  4. Manually set lastWorkspaceStart to a date more than 10 weeks ago in d_b_project_usage
  5. Trigger new Prebuilds -- verify that they now get auto-cancelled on start due to Project inactivity

Release Notes

Stop running prebuilds for projects that did not start a workspace in the last 10+ weeks

Documentation

@jankeromnes
Copy link
Contributor Author

On a side note, we now have both d_b_project_info (contains cached Branches information) and d_b_project_usage (contains two timestamps).

I considered adding the timestamps to d_b_project_info instead of creating a new table, but the concerns seemed somewhat different to me. Happy to discuss/change this though!

@gtsiolis
Copy link
Contributor

fyi: I've opened a follow up issue with a feature request to surface this automatic pause action better on the dashboard as this could be confusing to users coming back to inactive projects in the dashboard, see #9232. Feedback is welcome!

Cc @jankeromnes @easyCZ @jldec

@jankeromnes jankeromnes force-pushed the jx/inactive-projects branch 2 times, most recently from 1a3b8f3 to 36a1a6f Compare April 11, 2022 18:28
@jankeromnes
Copy link
Contributor Author

jankeromnes commented Apr 11, 2022

/werft run with-clean-slate-deployment

👍 started the job as gitpod-build-jx-inactive-projects.7

@jankeromnes jankeromnes force-pushed the jx/inactive-projects branch from 36a1a6f to 93a960c Compare April 11, 2022 19:04
@jankeromnes jankeromnes force-pushed the jx/inactive-projects branch from 93a960c to 904d37a Compare April 12, 2022 08:04
@jankeromnes jankeromnes force-pushed the jx/inactive-projects branch from 904d37a to 436ceea Compare April 12, 2022 08:50
@jankeromnes
Copy link
Contributor Author

Looks like this now when a project becomes inactive, but still triggers new prebuilds:

Screenshot 2022-04-12 at 11 30 54 Screenshot 2022-04-12 at 11 31 06

@jankeromnes jankeromnes marked this pull request as ready for review April 12, 2022 09:33
@jankeromnes jankeromnes requested a review from a team April 12, 2022 09:33
@github-actions github-actions bot added the team: webapp Issue belongs to the WebApp team label Apr 12, 2022
@PrimaryColumn(TypeORM.UUID_COLUMN_TYPE)
projectId: string;

@Column("varchar")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's historically the reason for using varchar to store timestamps? Is it possible to use a DATETIME?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question -- I don't have a strong opinion here, and was just following established practice 😊

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I ask is that when prototyping an ORM for Golang, Varchar timestamps make it much more complicated - this is further coupled with the somewhat non-standard usage of ISO8601 in javascript to represent a timestamp, which isn't quite RFC3339 compatible.

See https://github.com/gitpod-io/gitpod/pull/8967/files#diff-64ef712fa293d2472e903c3b14ffbaceb31a8b65b79e0f7a31316dbdcbe0c702R18

Definitely not trying to block this PR on this, mostly looking to understand historical reasoning for the varchar choice. Feel free to ignore this comment to land the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I ask is that when prototyping an ORM for Golang, Varchar timestamps make it much more complicated - this is further coupled with the somewhat non-standard usage of ISO8601 in javascript to represent a timestamp, which isn't quite RFC3339 compatible.

I think your concerns are valid, and there is definitely room for improvement in our DB types. 👍

Definitely not trying to block this PR on this, mostly looking to understand historical reasoning for the varchar choice. Feel free to ignore this comment to land the change.

Thanks! I would love to, but first I need a review. 😇

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing :)

Copy link
Contributor Author

@jankeromnes jankeromnes Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's awesome, thanks a bunch 🙏

I'll try to practice our new review process by assigning this Pull Request to you then (but please feel free to un-assign yourself again if that didn't make sense!)

@jankeromnes jankeromnes requested review from easyCZ and a team April 12, 2022 09:42
@jankeromnes jankeromnes removed the request for review from easyCZ April 12, 2022 09:43
@easyCZ
Copy link
Member

easyCZ commented Apr 12, 2022

One alternative approach we could consider is to load the d_b_project_usage async from existing tables (Prebuild, WorkspaceInstance) to treat it like more of a cache. Doing this would achieve the following:

  1. The handling code for a project doesn't need to be aware of the usage tracking - this eliminates opportunity for a bug where we forget to add the call to track
  2. It derives the usage from a single place, rather than multiple
  3. It makes the update process async which does not impact our latency on project related calls (the setting of the value).

WDYT?

Copy link
Member

@easyCZ easyCZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good, and worked for me based on the test specified. Leaving a couple of questions for my own context. Approving with a hold.

/hold

Posting our slack conversation about splitting PRs here as well for visibility.
For the PR, with regards to tiny PRs. I'd split it into the following:

  1. Add a method to call to updateProjectUsage but don't write to the DB, only increment a metric
  2. Create the table, it's unused
    Get data that this works
  3. Wire them together - updateProjectUsage actually writes to the DB

projectId: string;

@Column("varchar")
lastWebhookReceived: string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this field strictly needed for our first version? We do set it, but we're not using it in read paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it's not strictly needed here indeed. However, since I modified very related code, I thought this small drive-by change could be a good first step toward fixing #7010

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it up to you.

For surfacing which webhook we've received, I'd expose it as a first class-table with webhooks (and webhook payloads), such that we can show what we've received in the webhook as well (would definitely be useful for our debugging).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll leave this as is, and we can decide later whether we want a full-fledged table of webhook events, or whether showing a simple timestamp of the last event is good enough. 🛹

export class ProjectUsage1649667202321 implements MigrationInterface {
public async up(queryRunner: QueryRunner): Promise<void> {
await queryRunner.query(
"CREATE TABLE IF NOT EXISTS `d_b_project_usage` (`projectId` char(36) NOT NULL, `lastWebhookReceived` varchar(255) NOT NULL DEFAULT '', `lastWorkspaceStart` varchar(255) NOT NULL DEFAULT '', `deleted` tinyint(4) NOT NULL DEFAULT '0', `_lastModified` timestamp(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) ON UPDATE CURRENT_TIMESTAMP(6), PRIMARY KEY (`projectId`), KEY `ind_dbsync` (`_lastModified`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's we're setting a limit on projectId of 36 chars, should this also be reflected in the precision: 36 field on the Column of the ORM model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with the intricacies of our ORM model, so I wouldn't know the pros and cons of doing this.

Briefly looking through our code, it seems that we typically use precision for timestamps, but not for IDs. If you think this should be changed, could you please open an issue for this? 🙏

@@ -116,6 +116,15 @@ export class GitHubEnterpriseApp {
): Promise<StartPrebuildResult | undefined> {
const span = TraceContext.startSpan("GitHubEnterpriseApp.handlePushHook", ctx);
try {
const cloneURL = payload.repository.clone_url;
const projectAndOwner = await this.findProjectAndOwner(cloneURL, user);
if (projectAndOwner.project) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the point of this call, how do we know that the project specified in the payload in fact belongs to the correct user? Is it possible to poison our data across projects here?

Mostly lacking context, so trying to understand better how the payload from the webhook is authhorized

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the point of this call, how do we know that the project specified in the payload in fact belongs to the correct user?

That's a good question. I don't know about webhook events from the GitHub App, but for GitLab, we look at the provided (user + project) token in order to make sure that the project specified in the payload actually belongs to the user who installed this webhook:

protected async findUser(ctx: TraceContext, context: GitLabPushHook, req: express.Request): Promise<User> {
const span = TraceContext.startSpan("GitLapApp.findUser", ctx);
try {
const secretToken = req.header("X-Gitlab-Token");
span.setTag("secret-token", secretToken);
if (!secretToken) {
throw new Error("No secretToken provided.");
}
const [userid, tokenValue] = secretToken.split("|");
const user = await this.userDB.findUserById(userid);
if (!user) {
throw new Error("No user found for " + secretToken + " found.");
} else if (!!user.blocked) {
throw new Error(`Blocked user ${user.id} tried to start prebuild.`);
}
const identity = user.identities.find((i) => i.authProviderId === TokenService.GITPOD_AUTH_PROVIDER_ID);
if (!identity) {
throw new Error(`User ${user.id} has no identity for '${TokenService.GITPOD_AUTH_PROVIDER_ID}'.`);
}
const tokens = await this.userDB.findTokensForIdentity(identity);
const token = tokens.find((t) => t.token.value === tokenValue);
if (!token) {
throw new Error(`User ${user.id} has no token with given value.`);
}
if (
token.token.scopes.indexOf(GitlabService.PREBUILD_TOKEN_SCOPE) === -1 ||
token.token.scopes.indexOf(context.repository.git_http_url) === -1
) {
throw new Error(
`The provided token is not valid for the repository ${context.repository.git_http_url}.`,
);
}
return user;
} finally {
span.finish();
}
}

We do something similar for GitHub Enterprise webhooks, where the secret token is not sent as part of the webhook payload, but is used to sign the webhook payload (which we can verify against the user's tokens):

// Verify the webhook signature
const signature = req.header("X-Hub-Signature-256");
const body = (req as any).rawBody;
const tokenEntries = (await this.userDB.findTokensForIdentity(gitpodIdentity)).filter((tokenEntry) => {
return tokenEntry.token.scopes.includes(GitHubService.PREBUILD_TOKEN_SCOPE);
});
const signingToken = tokenEntries.find((tokenEntry) => {
const sig =
"sha256=" +
createHmac("sha256", user.id + "|" + tokenEntry.token.value)
.update(body)
.digest("hex");
return timingSafeEqual(Buffer.from(sig), Buffer.from(signature ?? ""));
});
if (!signingToken) {
throw new Error(`User ${user.id} has no token matching the payload signature.`);
}
return user;

Is it possible to poison our data across projects here?

I don't think it is possible to send forged webhook events to Gitpod.

However, your question does remind me of an issue where Pull Requests can come from forks, but trigger a Prebuild for the main repository, which can be problematic in some cases: https://github.com/gitpod-io/security/issues/26

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Apr 12, 2022

Many thanks for your very helpful review and suggestions @easyCZ! 🙏

One alternative approach we could consider is to load the d_b_project_usage async from existing tables (Prebuild, WorkspaceInstance) to treat it like more of a cache. Doing this would achieve the following:

  1. The handling code for a project doesn't need to be aware of the usage tracking - this eliminates opportunity for a bug where we forget to add the call to track
  2. It derives the usage from a single place, rather than multiple
  3. It makes the update process async which does not impact our latency on project related calls (the setting of the value).

WDYT?

I like the idea of doing these things more async, because it could make our system simpler and much less subtle/magical.

(For example, I initially updated lastWorkspaceStart only when starting from a prebuild, so starting a workspace not from a prebuild wouldn't count -- I've since added the missing update code, but it's pretty easy to imagine a future refactor where someone that could be me accidentally forgets one of these updates again.)

However, it feels like we're lacking some good framework code for such asynchronous jobs that regularly process large amounts of data in the background. I agree that we have to start somewhere, but I'm a bit hesitant to volunteer this Pull Request to be the first "async background processing job", given that it can already provide value as is. 🛹


I've also replied to your questions in-line. Many thanks for diving deeper into all these systems! I find your questions and instincts for improvements quite relevant, and am keen to help improve things or provide context whever I can. 🙂

Also, removing the hold, as I think this is good to go as is. Happy to improve anything in follow-ups. 🚀

/unhold

@roboquat roboquat merged commit ed30d96 into main Apr 12, 2022
@roboquat roboquat deleted the jx/inactive-projects branch April 12, 2022 17:07
@roboquat roboquat added deployed: webapp Meta team change is running in production deployed Change is completely running in production labels Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: webapp Meta team change is running in production deployed Change is completely running in production release-note size/L team: webapp Issue belongs to the WebApp team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[prebuilds] Automatically disable prebuilds for projects without active Workspace usage
4 participants