Skip to content

UBI9 Ray container image #177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 2, 2024
Merged

UBI9 Ray container image #177

merged 9 commits into from
Jul 2, 2024

Conversation

astefanutti
Copy link
Contributor

@astefanutti astefanutti commented Jun 26, 2024

Description

Provide a base Ray container image for Distributed Workloads, that includes the following layers / components:

  • UBI 9
  • Python 3.9
  • CUDA 12.1
  • Ray 2.23.0

Document how to build a custom container image on top of that base Ray container image.

See https://issues.redhat.com/browse/RHOAIENG-7846 and https://issues.redhat.com/browse/RHOAIENG-7875.

How Has This Been Tested?

This image has been successfully tested with the ray-finetune-llm-deepspeed example, by building a custom image on top of it, containing PyTorch 2.3.1, and fine-tuning Llama 7B model over several epochs (~10h run):

image

The Ray dashboard is working as expected, and GPU utilisation nominal:

image

'ray[all]==2.23.0'

# Restore user workspace
USER 1001
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wonder whether we should stick to the ray user that is set for the upstream images?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking whether keeping ray user makes any difference from user point of view.
I briefly remember that someone had some issue accessing some part of Ray dashboard (I know, the description is not much specific), though I cannot reproduce it for quay.io/rhoai/ray:2.23.0-py39-cu121

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sutaakar I might be wrong, are you referring to this issue? https://issues.redhat.com/browse/RHOAIENG-5493

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RHOAIENG-5493 is rather caused by some security constraints than with the user.

Actually the ray user forces to lower security constraints, as the Pod cannot run with an arbitrary user. It seems from the tests we have run, that this ray user is only a convention introduced by the upstream image, rather than a real need, and that it works fine without it.

@openshift-ci openshift-ci bot added lgtm and removed lgtm labels Jun 28, 2024
@astefanutti astefanutti requested review from sutaakar and KPostOffice and removed request for MichaelClifford and jbusche July 1, 2024 09:38
Copy link
Contributor

@Bobbins228 Bobbins228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

@sutaakar sutaakar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot removed the lgtm label Jul 1, 2024
Copy link
Contributor

@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm !!

@openshift-ci openshift-ci bot added the lgtm label Jul 1, 2024
Copy link

openshift-ci bot commented Jul 1, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Ygnas
Once this PR has been reviewed and has the lgtm label, please ask for approval from astefanutti. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@astefanutti astefanutti merged commit 0d87700 into main Jul 2, 2024
1 check was pending
@astefanutti astefanutti deleted the ray-image branch July 2, 2024 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants