UBI9 Ray container image #177

astefanutti · 2024-06-26T09:48:45Z

Description

Provide a base Ray container image for Distributed Workloads, that includes the following layers / components:

UBI 9
Python 3.9
CUDA 12.1
Ray 2.23.0

Document how to build a custom container image on top of that base Ray container image.

See https://issues.redhat.com/browse/RHOAIENG-7846 and https://issues.redhat.com/browse/RHOAIENG-7875.

How Has This Been Tested?

This image has been successfully tested with the ray-finetune-llm-deepspeed example, by building a custom image on top of it, containing PyTorch 2.3.1, and fine-tuning Llama 7B model over several epochs (~10h run):

The Ray dashboard is working as expected, and GPU utilisation nominal:

astefanutti · 2024-06-27T14:23:50Z

images/runtime/ray/Dockerfile

+    'ray[all]==2.23.0'
+
+# Restore user workspace
+USER 1001


Still wonder whether we should stick to the ray user that is set for the upstream images?

Thinking whether keeping ray user makes any difference from user point of view.
I briefly remember that someone had some issue accessing some part of Ray dashboard (I know, the description is not much specific), though I cannot reproduce it for quay.io/rhoai/ray:2.23.0-py39-cu121

@sutaakar I might be wrong, are you referring to this issue? https://issues.redhat.com/browse/RHOAIENG-5493

RHOAIENG-5493 is rather caused by some security constraints than with the user.

Actually the ray user forces to lower security constraints, as the Pod cannot run with an arbitrary user. It seems from the tests we have run, that this ray user is only a convention introduced by the upstream image, rather than a real need, and that it works fine without it.

images/runtime/ray/Dockerfile

images/runtime/ray/NGC-DL-CONTAINER-LICENSE

images/runtime/README.md

Bobbins228

/lgtm

sutaakar

/lgtm

images/runtime/examples/README.md

ChristianZaccaria

/lgtm !!

…tion

openshift-ci · 2024-07-01T15:56:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Ygnas
Once this PR has been reviewed and has the lgtm label, please ask for approval from astefanutti. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added the do-not-merge/work-in-progress label Jun 26, 2024

astefanutti force-pushed the ray-image branch from f259d07 to f537609 Compare June 26, 2024 16:31

astefanutti marked this pull request as ready for review June 26, 2024 16:39

openshift-ci bot removed the do-not-merge/work-in-progress label Jun 26, 2024

openshift-ci bot requested review from jbusche and MichaelClifford June 26, 2024 16:39

astefanutti commented Jun 27, 2024

View reviewed changes

Ygnas reviewed Jun 27, 2024

View reviewed changes

images/runtime/ray/Dockerfile Outdated Show resolved Hide resolved

KPostOffice reviewed Jun 27, 2024

View reviewed changes

images/runtime/ray/NGC-DL-CONTAINER-LICENSE Outdated Show resolved Hide resolved

KPostOffice reviewed Jun 27, 2024

View reviewed changes

images/runtime/README.md Outdated Show resolved Hide resolved

astefanutti force-pushed the ray-image branch from f537609 to cd84ba0 Compare June 28, 2024 07:28

astefanutti mentioned this pull request Jun 28, 2024

Default to Ray container image provided by OpenShift AI project-codeflare/codeflare-sdk#576

Merged

4 tasks

astefanutti force-pushed the ray-image branch from cd84ba0 to ed89415 Compare June 28, 2024 07:58

Ygnas approved these changes Jun 28, 2024

View reviewed changes

openshift-ci bot assigned Ygnas Jun 28, 2024

openshift-ci bot added lgtm and removed lgtm labels Jun 28, 2024

astefanutti requested review from sutaakar and KPostOffice and removed request for MichaelClifford and jbusche July 1, 2024 09:38

Bobbins228 reviewed Jul 1, 2024

View reviewed changes

openshift-ci bot assigned Bobbins228 Jul 1, 2024

openshift-ci bot added the lgtm label Jul 1, 2024

sutaakar reviewed Jul 1, 2024

View reviewed changes

openshift-ci bot assigned sutaakar Jul 1, 2024

Ygnas requested changes Jul 1, 2024

View reviewed changes

images/runtime/examples/README.md Outdated Show resolved Hide resolved

images/runtime/examples/README.md Outdated Show resolved Hide resolved

openshift-ci bot removed the lgtm label Jul 1, 2024

ChristianZaccaria reviewed Jul 1, 2024

View reviewed changes

openshift-ci bot assigned ChristianZaccaria Jul 1, 2024

openshift-ci bot added the lgtm label Jul 1, 2024

astefanutti force-pushed the ray-image branch from 8e3266b to 42cce5d Compare July 1, 2024 15:53

openshift-ci bot removed the lgtm label Jul 1, 2024

astefanutti added 9 commits July 1, 2024 17:55

UBI9 Ray container image

61bbbda

Custom Ray image with Torch example

a4f76b6

Work-around Dao-AILab/flash-attention#453 when installing Flash Atten…

da054a1

…tion

Upgrade DeepSpeed to 0.14.4 for NumPy 2 support

f8f37c1

Use custom Torch image based on UBI9 Ray image

1d2a8c2

Add note for TensorBoard compatibility with NumPy 2

c29bdf6

Downgrade Ray container image to Python 3.9

6781d24

Stick to NumPy < 2.0.0 for user-space compatibility

ea3dc9f

doc: Add instructions to build custom container images

33b46e6

astefanutti force-pushed the ray-image branch from 42cce5d to 33b46e6 Compare July 1, 2024 15:55

Ygnas approved these changes Jul 1, 2024

View reviewed changes

openshift-ci bot added the lgtm label Jul 1, 2024

astefanutti merged commit 0d87700 into main Jul 2, 2024
1 check was pending

astefanutti deleted the ray-image branch July 2, 2024 07:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UBI9 Ray container image #177

UBI9 Ray container image #177

Uh oh!

astefanutti commented Jun 26, 2024 •

edited

Loading

Uh oh!

astefanutti Jun 27, 2024

Uh oh!

sutaakar Jul 1, 2024

Uh oh!

ChristianZaccaria Jul 1, 2024

Uh oh!

astefanutti Jul 1, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bobbins228 left a comment

Uh oh!

sutaakar left a comment

Uh oh!

Uh oh!

Uh oh!

ChristianZaccaria left a comment •

edited

Loading

Uh oh!

openshift-ci bot commented Jul 1, 2024

Uh oh!

Uh oh!

Uh oh!

UBI9 Ray container image #177

UBI9 Ray container image #177

Uh oh!

Conversation

astefanutti commented Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Uh oh!

astefanutti Jun 27, 2024

Choose a reason for hiding this comment

Uh oh!

sutaakar Jul 1, 2024

Choose a reason for hiding this comment

Uh oh!

ChristianZaccaria Jul 1, 2024

Choose a reason for hiding this comment

Uh oh!

astefanutti Jul 1, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bobbins228 left a comment

Choose a reason for hiding this comment

Uh oh!

sutaakar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ChristianZaccaria left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jul 1, 2024

Uh oh!

Uh oh!

Uh oh!

astefanutti commented Jun 26, 2024 •

edited

Loading

ChristianZaccaria left a comment •

edited

Loading