-
Notifications
You must be signed in to change notification settings - Fork 57
UBI9 Ray container image #177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
'ray[all]==2.23.0' | ||
|
||
# Restore user workspace | ||
USER 1001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still wonder whether we should stick to the ray
user that is set for the upstream images?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking whether keeping ray
user makes any difference from user point of view.
I briefly remember that someone had some issue accessing some part of Ray dashboard (I know, the description is not much specific), though I cannot reproduce it for quay.io/rhoai/ray:2.23.0-py39-cu121
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sutaakar I might be wrong, are you referring to this issue? https://issues.redhat.com/browse/RHOAIENG-5493
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RHOAIENG-5493 is rather caused by some security constraints than with the user.
Actually the ray
user forces to lower security constraints, as the Pod cannot run with an arbitrary user. It seems from the tests we have run, that this ray
user is only a convention introduced by the upstream image, rather than a real need, and that it works fine without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm !!
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Ygnas The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
Provide a base Ray container image for Distributed Workloads, that includes the following layers / components:
Document how to build a custom container image on top of that base Ray container image.
See https://issues.redhat.com/browse/RHOAIENG-7846 and https://issues.redhat.com/browse/RHOAIENG-7875.
How Has This Been Tested?
This image has been successfully tested with the ray-finetune-llm-deepspeed example, by building a custom image on top of it, containing PyTorch 2.3.1, and fine-tuning Llama 7B model over several epochs (~10h run):
The Ray dashboard is working as expected, and GPU utilisation nominal: