Skip to content

Refactor creation logic of ingress/routes into RayCluster Controller #493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ChristianZaccaria
Copy link
Contributor

@ChristianZaccaria ChristianZaccaria commented Mar 21, 2024

Issue link

Issue: https://issues.redhat.com/browse/RHOAIENG-1056

What changes have been made

  • Refactor the ingress/route creation and TLS support to the RayCluster Controller in the CFO.
  • Added creation of rayclient route OpenShift, rayclient ingress and ingresses for KinD/vanilla clusters.
  • RayCluster Controller to always run alongside the CFO.
  • Adjusted e2e tests to disable OAuth in the CFO config, using this config bool value to indicate to the RayCluster Controller whether to create a secure route and OAuth resources on OpenShift or an ingress for KinD clusters (for e2e tests).
  • Created a support package and several functions including:
    • createRayClientRoute
    • createRayClientIngress
    • createIngressApplyConfiguration
    • isOnKindCluster
    • getClusterType
    • annotationBoolVal
    • getIngressDomain

Verification steps

This PR should be tested alongside this PR in the SDK which removes the creation of ingress/routes creation logic: project-codeflare/codeflare-sdk#495
To test these changes:
In an OpenShift cluster:

  1. Checkout this PR and run make run NAMESPACE=default and make install.
  2. Deploy and install KubeRay v1.1.0
  3. Checkout this PR and install the SDK: Remove ingress/routes logic from SDK codeflare-sdk#495
  4. Deploy Kueue and create a LocalQueue and ClusterQueue.
  5. Run through this notebook
  6. Switch mcad to false.
  7. Add write_to_file to be true to have access to the RayCluster yaml and add the Kueue label kueue.x-k8s.io/queue-name: <name of localqueue>
  8. Add an ingress_domain to the ClusterConfiguration i.e., apps.rosa.clustername.k1pm.p3.openshiftapps.com
  9. Once cluster.up() is ran, the oauth-proxy sidecar is created by default on the SDK side if on an OpenShift cluster and the RayCluster Controller creates all the necessary OAuth resources + necessary rayclient route if the RayCluster is not in a suspended state provided by Kueue. If not enough resources, The RayCluster controller will hold the creation of the routes/ingresses.
  10. After going through the notebook, the route should be accessible through authentication and running ray.get(ref) should result in = 1789.4644.....

In a KinD cluster:

  1. For the local_interactive notebook, same steps as above except you will find that no OAuth resources are created including the oauth-proxy sidecar container. The ingress and RayClient ingress will be created by the RayCluster Controller.
  • Add in your local /etc/hosts 127.0.0.1 kind as per documentation
  • For the ingress_domain, use kind unless you are running an un-named KinD cluster of which hostname will resolve to kind-control-plane by default, in this scenario, ingress_domain is not required.
  • After going through the notebook, the route should be accessible through authentication and running ray.get(ref) should result in = 1789.4644.....
  1. For a basic notebook, same steps involved as above, except you will find that the RayClient ingress is not created this time, only the ingress to the service is created.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Copy link

openshift-ci bot commented Mar 21, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ChristianZaccaria ChristianZaccaria changed the title WIP - Creation/Deletion of ingress and routes in RayCluster Controller WIP - creation and deletion of ingress/routes logic in RayCluster Controller Mar 21, 2024
@ChristianZaccaria ChristianZaccaria changed the title WIP - creation and deletion of ingress/routes logic in RayCluster Controller WIP - refactor creation logic of ingress/routes into RayCluster Controller Mar 21, 2024
@ChristianZaccaria ChristianZaccaria force-pushed the rc-controller-kueue branch 3 times, most recently from 991d9eb to 6835c21 Compare March 28, 2024 21:35
@ChristianZaccaria ChristianZaccaria force-pushed the rc-controller-kueue branch 5 times, most recently from 02e9c95 to 1d9e1fc Compare April 3, 2024 18:09
@ChristianZaccaria ChristianZaccaria changed the title WIP - refactor creation logic of ingress/routes into RayCluster Controller Refactor creation logic of ingress/routes into RayCluster Controller Apr 4, 2024
@ChristianZaccaria
Copy link
Contributor Author

/hold Should be merged with project-codeflare/codeflare-sdk#495

@@ -97,6 +101,10 @@ func (r *RayClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request)
return ctrl.Result{}, client.IgnoreNotFound(err)
}

isLocalInteractive := annotationBoolVal(logger, &cluster, "sdk.codeflare.dev/local_interactive")
isOpenShift, ingressHost := getClusterType(logger, r.kubeClient, &cluster)
ingressDomain := getIngressDomain(&cluster)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you only care about the presence of ingress

This can just be simplified to:

_, ingressDomainExists := cluster.ObjectMeta.Annotations["sdk.codeflare.dev/ingress_domain"]

and edit the conditionals below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why that can be implied, that was wrong on my part. I do require the actual value of the ingress domain annotation too as it's used for the Host field for creating ingresses and RayClient ingress/route.

I was originally declaring a new variable for ingress domain on each creation of ingresses, that's why it looks like I only needed the presence of it. Great catch thanks. Changes made.

Comment on lines 163 to 157
} else {
logger.Info("Cannot retrieve config, assuming we're on Vanilla Kubernetes")
return false, fmt.Sprintf("ray-dashboard-%s-%s.%s", cluster.Name, cluster.Namespace, ingress_domain)
}
Copy link
Collaborator

@KPostOffice KPostOffice Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to:

if err != nil || dclient == nil {
    logger.Info("Cannot retrieve config, assuming we're on Vanilla Kubernetes")
    return false, fmt.Sprintf("ray-dashboard-%s-%s.%s", cluster.Name, cluster.Namespace, ingress_domain)
}

and then leave the majority of the body outside of conditionals

#NeverNester https://www.youtube.com/watch?v=CFRhGnuXG-4

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops wrong conditional, but same with this one:

if err != nil || config == nil {
    ...
    return ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow Kevin that video was so enlightening! Thanks for saving me early, I will 100% follow the path of a Never Nester. Hopefully this next commit shows that :) - Thanks a lot for the video and advise.

return cluster.Name + "-head-svc"
}

func createRayClientRoute(cluster *rayv1.RayCluster) *routeapply.RouteApplyConfiguration {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think create is a misnomer here. Can you use the same verbage use for the other resources (desired)?

Copy link
Contributor Author

@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I added a commit per comment - Will squash them afterwards.

@@ -60,6 +60,7 @@ jobs:
run: |
echo Deploying CodeFlare operator
IMG="${REGISTRY_ADDRESS}"/codeflare-operator
sed -i 's/RayDashboardOAuthEnabled: pointer.Bool(true)/RayDashboardOAuthEnabled: pointer.Bool(false)/' main.go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be done in the e2e ConfigMap instead.

@@ -184,8 +184,8 @@ func main() {
}

v, err := HasAPIResourceForGVK(kubeClient.DiscoveryClient, rayv1.GroupVersion.WithKind("RayCluster"))
if v && *cfg.KubeRay.RayDashboardOAuthEnabled {
rayClusterController := controllers.RayClusterReconciler{Client: mgr.GetClient(), Scheme: mgr.GetScheme()}
if v {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's rename that v to ok :)

@@ -50,15 +52,17 @@ type RayClusterReconciler struct {
routeClient *routev1client.RouteV1Client
Scheme *runtime.Scheme
CookieSalt string
Config *config.CodeFlareOperatorConfiguration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe only the RayCluster controller configuration could be enough here?

@@ -97,6 +101,10 @@ func (r *RayClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request)
return ctrl.Result{}, client.IgnoreNotFound(err)
}

isLocalInteractive := annotationBoolVal(ctx, &cluster, "sdk.codeflare.dev/local_interactive", false)
ingressDomain := cluster.ObjectMeta.Annotations["sdk.codeflare.dev/ingress_domain"]
isOpenShift, ingressHost := getClusterType(ctx, r.kubeClient, &cluster, ingressDomain)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the cluster type should only be done once when the operator starts. Calling the Discovery API for each reconciliation is also not really acceptable.

return true, ""
}
}
onKind, _ := isOnKindCluster(clientset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That really feels like testing concerns leaking in application code. Why is it needed to explicitly check KinD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we don't explicitly need to check for KinD, anything that is not OpenShift could suit. The only reason that we check for KinD is for our own testing purposes. If it's true, then the ingress Host will use "kind". We could make it more generic by supplying the ingress_domain to the e2e tests, then there is no need for checking for KinD explicitly. Should I change or leave as is for now... WDYT?

if err != nil {
logger.Error(err, "Failed to update OAuth Route")
}
if cluster.Status.State != "suspended" && r.isRayDashboardOAuthEnabled() && isOpenShift {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible the cluster transitions from running to suspended state. Should the resources be removed in that case?

Comment on lines 104 to 105
isLocalInteractive := annotationBoolVal(ctx, &cluster, "sdk.codeflare.dev/local_interactive", false)
ingressDomain := cluster.ObjectMeta.Annotations["sdk.codeflare.dev/ingress_domain"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does justify we let clients / users decide for these? Shouldn't that be the responsibility of the platform admin to configure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are the annotations that are created when local_interactive is true or the user is using an ingress domain.
The local_interactive one is used for the RCC know to create the Ray Client.
And the ingress domain one is needed for Kubernetes clusters right?

Copy link
Contributor

@Bobbins228 Bobbins228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Just tested it out and it creates oauth objects successfully and does not when the RC is suspended.
Also tested suspended -> admitted working as expected

@openshift-ci openshift-ci bot added the lgtm label Apr 5, 2024
@dimakis dimakis requested a review from astefanutti April 5, 2024 15:18
@dimakis dimakis merged commit 465da20 into project-codeflare:main Apr 5, 2024
@sutaakar
Copy link
Contributor

sutaakar commented Apr 8, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants