-
Notifications
You must be signed in to change notification settings - Fork 61
Refactor creation logic of ingress/routes into RayCluster Controller #493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor creation logic of ingress/routes into RayCluster Controller #493
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
991d9eb
to
6835c21
Compare
02e9c95
to
1d9e1fc
Compare
/hold Should be merged with project-codeflare/codeflare-sdk#495 |
@@ -97,6 +101,10 @@ func (r *RayClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) | |||
return ctrl.Result{}, client.IgnoreNotFound(err) | |||
} | |||
|
|||
isLocalInteractive := annotationBoolVal(logger, &cluster, "sdk.codeflare.dev/local_interactive") | |||
isOpenShift, ingressHost := getClusterType(logger, r.kubeClient, &cluster) | |||
ingressDomain := getIngressDomain(&cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you only care about the presence of ingress
This can just be simplified to:
_, ingressDomainExists := cluster.ObjectMeta.Annotations["sdk.codeflare.dev/ingress_domain"]
and edit the conditionals below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why that can be implied, that was wrong on my part. I do require the actual value of the ingress domain annotation too as it's used for the Host field for creating ingresses and RayClient ingress/route.
I was originally declaring a new variable for ingress domain on each creation of ingresses, that's why it looks like I only needed the presence of it. Great catch thanks. Changes made.
pkg/controllers/support.go
Outdated
} else { | ||
logger.Info("Cannot retrieve config, assuming we're on Vanilla Kubernetes") | ||
return false, fmt.Sprintf("ray-dashboard-%s-%s.%s", cluster.Name, cluster.Namespace, ingress_domain) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this to:
if err != nil || dclient == nil {
logger.Info("Cannot retrieve config, assuming we're on Vanilla Kubernetes")
return false, fmt.Sprintf("ray-dashboard-%s-%s.%s", cluster.Name, cluster.Namespace, ingress_domain)
}
and then leave the majority of the body outside of conditionals
#NeverNester https://www.youtube.com/watch?v=CFRhGnuXG-4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops wrong conditional, but same with this one:
if err != nil || config == nil {
...
return ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow Kevin that video was so enlightening! Thanks for saving me early, I will 100% follow the path of a Never Nester. Hopefully this next commit shows that :) - Thanks a lot for the video and advise.
pkg/controllers/support.go
Outdated
return cluster.Name + "-head-svc" | ||
} | ||
|
||
func createRayClientRoute(cluster *rayv1.RayCluster) *routeapply.RouteApplyConfiguration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think create
is a misnomer here. Can you use the same verbage use for the other resources (desired
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I added a commit per comment - Will squash them afterwards.
ccc3315
to
7071fd3
Compare
@@ -60,6 +60,7 @@ jobs: | |||
run: | | |||
echo Deploying CodeFlare operator | |||
IMG="${REGISTRY_ADDRESS}"/codeflare-operator | |||
sed -i 's/RayDashboardOAuthEnabled: pointer.Bool(true)/RayDashboardOAuthEnabled: pointer.Bool(false)/' main.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be done in the e2e ConfigMap instead.
@@ -184,8 +184,8 @@ func main() { | |||
} | |||
|
|||
v, err := HasAPIResourceForGVK(kubeClient.DiscoveryClient, rayv1.GroupVersion.WithKind("RayCluster")) | |||
if v && *cfg.KubeRay.RayDashboardOAuthEnabled { | |||
rayClusterController := controllers.RayClusterReconciler{Client: mgr.GetClient(), Scheme: mgr.GetScheme()} | |||
if v { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's rename that v
to ok
:)
@@ -50,15 +52,17 @@ type RayClusterReconciler struct { | |||
routeClient *routev1client.RouteV1Client | |||
Scheme *runtime.Scheme | |||
CookieSalt string | |||
Config *config.CodeFlareOperatorConfiguration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe only the RayCluster controller configuration could be enough here?
@@ -97,6 +101,10 @@ func (r *RayClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) | |||
return ctrl.Result{}, client.IgnoreNotFound(err) | |||
} | |||
|
|||
isLocalInteractive := annotationBoolVal(ctx, &cluster, "sdk.codeflare.dev/local_interactive", false) | |||
ingressDomain := cluster.ObjectMeta.Annotations["sdk.codeflare.dev/ingress_domain"] | |||
isOpenShift, ingressHost := getClusterType(ctx, r.kubeClient, &cluster, ingressDomain) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the cluster type should only be done once when the operator starts. Calling the Discovery API for each reconciliation is also not really acceptable.
return true, "" | ||
} | ||
} | ||
onKind, _ := isOnKindCluster(clientset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That really feels like testing concerns leaking in application code. Why is it needed to explicitly check KinD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, we don't explicitly need to check for KinD, anything that is not OpenShift could suit. The only reason that we check for KinD is for our own testing purposes. If it's true, then the ingress Host will use "kind". We could make it more generic by supplying the ingress_domain to the e2e tests, then there is no need for checking for KinD explicitly. Should I change or leave as is for now... WDYT?
if err != nil { | ||
logger.Error(err, "Failed to update OAuth Route") | ||
} | ||
if cluster.Status.State != "suspended" && r.isRayDashboardOAuthEnabled() && isOpenShift { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible the cluster transitions from running to suspended state. Should the resources be removed in that case?
isLocalInteractive := annotationBoolVal(ctx, &cluster, "sdk.codeflare.dev/local_interactive", false) | ||
ingressDomain := cluster.ObjectMeta.Annotations["sdk.codeflare.dev/ingress_domain"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does justify we let clients / users decide for these? Shouldn't that be the responsibility of the platform admin to configure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are the annotations that are created when local_interactive is true or the user is using an ingress domain.
The local_interactive one is used for the RCC know to create the Ray Client.
And the ingress domain one is needed for Kubernetes clusters right?
7071fd3
to
a62b141
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Just tested it out and it creates oauth objects successfully and does not when the RC is suspended.
Also tested suspended -> admitted working as expected
I miss here a test coverage being added into https://github.com/project-codeflare/codeflare-operator/blob/faff28ab80941db913f509ffdc22eb9cffbed7c8/pkg/controllers/raycluster_controller_test.go |
Issue link
Issue: https://issues.redhat.com/browse/RHOAIENG-1056
What changes have been made
createRayClientRoute
createRayClientIngress
createIngressApplyConfiguration
isOnKindCluster
getClusterType
annotationBoolVal
getIngressDomain
Verification steps
This PR should be tested alongside this PR in the SDK which removes the creation of ingress/routes creation logic: project-codeflare/codeflare-sdk#495
To test these changes:
In an OpenShift cluster:
make run NAMESPACE=default
andmake install
.v1.1.0
LocalQueue
andClusterQueue
.mcad
to false.write_to_file
to be true to have access to the RayCluster yaml and add the Kueue labelkueue.x-k8s.io/queue-name: <name of localqueue>
ingress_domain
to the ClusterConfiguration i.e.,apps.rosa.clustername.k1pm.p3.openshiftapps.com
suspended
state provided by Kueue. If not enough resources, The RayCluster controller will hold the creation of the routes/ingresses.ray.get(ref)
should result in = 1789.4644.....In a KinD cluster:
127.0.0.1 kind
as per documentationkind
unless you are running an un-named KinD cluster of which hostname will resolve tokind-control-plane
by default, in this scenario,ingress_domain
is not required.ray.get(ref)
should result in = 1789.4644.....Checks