-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add installer tests for GKE and k3s #10365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0a078ae
to
b8a6f31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, it looks good so far. However, I would like to get a second pair of eyes for this from the platform team. @meysholdt 👀
name: gitpod-issuer | ||
spec: | ||
acme: | ||
email: [email protected] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, you can simply remove the email line. This is only needed to notify when a cert expires (which is not relevant for test setups).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops! you are right, totally missed to clean that up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need an email address for internal use as a sink for all emails from LetsEncrypt - we already use [email protected]
for that. With that being said - if it's possible to not specify an email at all then that is the better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out, I could just leave out the email
key!
b8a6f31
to
5832150
Compare
install/infra/terraform/gke/main.tf
Outdated
# We can't create a cluster with no node pool defined, but we want to only use | ||
# separately managed node pools. So we create the smallest possible default | ||
# node pool and immediately delete it. | ||
remove_default_node_pool = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleting the node pool and creating another node pool can make "terraform apply" ~10min slower.
Proposed solution: Instead of deleting the default node pool you configure it to be the "service" node pool. Will this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the default node pool uses the default GCP service account, which grants broad permissions. For testing purposes, reusing this node pool is reasonable since it'll be extremely hard to exploit this, so I think that running services on the default node pool is reasonable. For production clusters it may be worth taking the additional cost of spinning up specialized node pools and wiping the default pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great point. The solution to re-use the default node as service
node is something I thought too. But I couldn't see an option at least in the container cluster terraform resource to set labels only to the default node(the labels being meta
and ide
). If I set the label
field, it would apply the same label to every nodepool. For the testing purposes, I imagine this is probably okay. Not sure if this is the best course of action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out, I couldn't have been more wrong. That option only applied the labels to the default node. Making this change! Thanks for the suggestion, Mo!
install/infra/terraform/gke/main.tf
Outdated
|
||
preemptible = var.pre-emptible | ||
image_type = "UBUNTU_CONTAINERD" | ||
disk_type = "pd-ssd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a standrad hdd is chaper is should be sufficient.
install/infra/terraform/gke/main.tf
Outdated
} | ||
|
||
preemptible = var.pre-emptible | ||
image_type = "UBUNTU_CONTAINERD" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The COS image is optimised for performance and security. Since we don't run workspaces on this node pool (that are not compatible with COS), I think we should use COS nodes for the service pool.
install/infra/terraform/gke/main.tf
Outdated
|
||
management { | ||
auto_repair = true | ||
auto_upgrade = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You intending to enable preemtible nodes shows me that you're confident services will robustly hande restarts.
if that's true, it should be find to enable auto_upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@meysholdt I have to be honest, I am not super confident how fault-tolerant gitpod
setup is. In a way, I was basically trying to translate the current gitpod-gke-guide configuration pretty much exactly like that to terraform
and hence this disparity. I will make preemptible
not configurable for now. Will iterate on this if gitpod
is fault-tolerant enough, considering preemptible
VMs are cheaper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure prebuilds are not fault tolerant yet. If they were, we'd be running them on spot VMs now.
variable "zone" { | ||
type = string | ||
description = "The zone to create the cluster." | ||
default = "b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document how to make the cluster regional (high-availablility-k8s-master) vs. zonal?
default = "1.21.11-gke.1100" | ||
} | ||
|
||
variable "name" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why make this configurable? Do you wan to support more than one cluster per GCP project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the possibility of parallel CI jobs creating parallel clusters; parameterizing the cluster name is cheap to do now but can be a bit painful down the road. I'm in favor of this approach.
(Side note; if we have multiple parallel CI jobs then the terraform state bucket will need to have a prefix per run, but this is a yak that we can shave another day.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like @adrienthebo suggested, the main reason for considering configurable name parameter is parallel CI runs(it is highly likely since we are planning to test multiple configs). I was hoping that having the default
name reduces the hassle a bit.
Regarding the state @adrienthebo , this is a great point. The way I have worked around it currently, is by using terraform workspaces in the Makefile since I couldn't set the the prefix
using variables. Each job is run in its own tf workspace and this means the state file will have the name <workspace-name>.tfstate
.
} | ||
|
||
variable "min_count" { | ||
type = number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this configurable?
default = 50 | ||
} | ||
|
||
variable "disk_size_gb" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this configurable? I think the disk size can be decided statically based on what our services need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption was that, if the user decides to use more in-cluster storage, it would mean that they need more storage. So keeping it configurable made sense at the time.
default = 100 | ||
} | ||
|
||
variable "initial_node_count" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this configurable?
|
||
project = "sh-automated-tests" | ||
region = "europe-west1" | ||
zone = "b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file helpful? I'm asking because variables.tf
already has defaults and for the sake of simplicity it would be nice to keep the number if files low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Removed.
|
||
gcp_project = "sh-automated-tests" | ||
gcp_region = "europe-west1" | ||
gcp_zone = "europe-west1-b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file helpful? I'm asking because variables.tf
already has defaults and for the sake of simplicity it would be nice to keep the number if files low.
👋 this is awesome to see! 🙏 Thank you so much. @corneliusludmann What is the plan to help teams own exclusive testing for their domains of expertise? For example, I do not see workspace sub-folders, or updates to the codeowners file for this PR. Is that something that we'd layer on later, in a follow-on PR? For example, I see some workspace things now, and if we are to own them later, I would prefer that the workspace team is looped in as an explicit approver so we can have related dialog. |
} | ||
} | ||
|
||
resource "google_container_cluster" "gitpod-cluster" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting this for information sharing, no action needed.
This module handles a number of cases that are great for a production cluster (and thus would be great for a reference architecture module), but may be omitted during testing. Features like cluster autoscaling or regional clusters will provide better performance and resilience for production traffic but for everyday system testing we shouldn't need a lot of nodes, autoscaling, or resilience in the case of a failed GCP zone.
Having a single, unified module for system tests (automated and manual) as well as reference architectures will be fantastic, and this will be great for that - so I don't think we need to pull out any features. However keeping an eye towards the overall cost of test clusters gives us an advantage; if we can drive down the cost of a single test run then we can run tests more cheaply and more frequently.
variable "zone" { | ||
type = string | ||
description = "The zone to create the cluster." | ||
default = "b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this default will behave as expected; my understanding (rooted in the google provider circa 2019) of the google provider block is that the zone field is the fully qualified zone name, so europe-west1-b
.
My understanding may be out of date though, so we'll need to check if this works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true @adrienthebo. Like you can see here, I am appending this value(eg. b
) to the var.region
(eg. europe-west1
). In retrospect, I understand this is particularly hacky. Will fix. 👍
variable "max_count" { | ||
type = number | ||
description = "Maximum number of nodes in the NodePool. Must be >= min_node_count." | ||
default = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment: a default max_count of 50 for Gitpod services seems pretty high, especially for tests. If we have a reasonable estimate of the number of nodes spun up during an average test run then we may be able to constrain this lower.
In addition, it may be worthwhile to have separate max_count values for the services and workspaces node pools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out currently suggested max_count
value in the guides are 100
, I wanted to bring it a bit down if we are using default but keeping it configurable. We can configure this value in the main.tf
used for tests. I believe it is safe to assume that install/infra/terraform/
will have more production ready code and install/tests
has test code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello!
I'm dropping in to review this to get a feel for how y'all perform code reviews.
I'd like to start by noting that I don't yet know the culture or review practices for Gitpod; sincere apologies if I'm stepping on toes, dropping too many comments, or focusing on topics that might be out of scope for this PR. In addition I'm approaching this PR from the perspective of my role as a GCP consultant, which is an outside perspective driven by GCP best practices but isn't yet informed about Gitpod's standard GCP operations. I hope to share my experience in this domain and offer useful insights in constructive and effective ways, and getting this down will take a few review cycles. Thanks for bearing with me!
My overall perspective is that this is a rock solid foundation for system testing the Gitpod installer for self hosting purposes. I've been talking with @mrzarquon regarding reference architectures and automation for self hosted customers and this is the type of module that could be the basis for that reference, which is really awesome.
In terms of correctness, I think I saw one place (the zone
variable) that needed a tweak to behave as expected but it might be using a feature (the full zone name being derived based on region + zone
) that I'm not familiar with. 👍
Summarizing some of my inline comments, there's some functionality in this module that's well suited for a reference architecture but may benefit from adjustments for running in a CI system. If our installer systems tests are cheap and fast to run then we can run them more frequently, and lightweight test environments will allow upstream tests to validate changes impacting the installation process earlier.
All in all, I think this is really good and look forward to taking this for a spin!
install/infra/terraform/gke/main.tf
Outdated
# We can't create a cluster with no node pool defined, but we want to only use | ||
# separately managed node pools. So we create the smallest possible default | ||
# node pool and immediately delete it. | ||
remove_default_node_pool = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the default node pool uses the default GCP service account, which grants broad permissions. For testing purposes, reusing this node pool is reasonable since it'll be extremely hard to exploit this, so I think that running services on the default node pool is reasonable. For production clusters it may be worth taking the additional cost of spinning up specialized node pools and wiping the default pool.
default = 1 | ||
} | ||
|
||
variable "pre-emptible" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny style concern - I think that hyphens are allowed in variable names but underscores are preferred (or omitting the separator entirely in this case).
@kylos101 Yes, we need more iterations to add component specific tests. In general, I have tried to move the current guides that we have for self-hosted setup on |
@nandajavarma that is super helpful to know, thank you! So this PR is more about provisioning infrastructure and installing Gitpod, and then running our existing tests? Is it safe to say that we'll see future iterations of this, that use different reference architectures to provision infrastructure and install Gitpod? |
@kylos101 yes, correct! We will add automated setup of different reference architectures(for eg: the WIP aks one here) including all available configurations(eg: self-signed certs or otherwise) to test on in following iterations. |
f042e5c
to
5d02eed
Compare
46ff721
to
562285c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ship this as a first iteration and see how well it works. Good job!
Description
This PR adds the first draft of automated installer tests for
GKE
andk3s
. As of this PR the following are the pieces of the testing pipeline:installer-tests.ts
script that has phases defined(each pointing to atarget
in aMakefile
to perform an operation like create cluster, install gitpod, etc.). A test configuration is basically a combination of multiple phases. This allows easy re-use of code and testing against multiple configurations.Makefile
that has the targets mentioned aboveYAML
files that define a cron job which initiates one of the test configurations. There are two added with this PR: 1) GKE 2) k3sThe rest is basically configurations necessary for the
make targets
. These include:gke
and 2)k3s
modulescluster-issuer
orkots-config
all present in themanifests
folder.The
terraform
modules can be use individually to setup the cluster, usage is documented in the corresponding READMEs:gke
k3s
Related Issue(s)
Fixes #7316
How to test
To create a
GKE
cluster, setupGitpod
and run existing tests on it, run:Similarly, to run on a
k3s
cluster:Release Notes
Documentation