Skip to content

Updates in GCP set up #846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 16 additions & 16 deletions .github/workflows/regression_tests.yml

Large diffs are not rendered by default.

85 changes: 0 additions & 85 deletions .github/workflows/regression_tests_variants.yml

This file was deleted.

9 changes: 6 additions & 3 deletions docker/build_docker_images.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ do
esac
done

# Artifact repostiory
ARTIFACT_REPO="europe-docker.pkg.dev/mlcommons-algoperf/algoperf-docker-repo"

if [[ -z ${GIT_BRANCH+x} ]]
then
GIT_BRANCH='main' # Set default argument
Expand All @@ -22,9 +25,9 @@ for FRAMEWORK in "jax" "pytorch" "both"
do
IMAGE_NAME="algoperf_${FRAMEWORK}_${GIT_BRANCH}"
DOCKER_BUILD_COMMAND="docker build --no-cache -t $IMAGE_NAME . --build-arg framework=$FRAMEWORK --build-arg branch=$GIT_BRANCH"
DOCKER_TAG_COMMAND="docker tag $IMAGE_NAME us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/$IMAGE_NAME"
DOCKER_PUSH_COMMAND="docker push us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/$IMAGE_NAME"
DOCKER_PULL_COMMAND="docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/$IMAGE_NAME"
DOCKER_TAG_COMMAND="docker tag $IMAGE_NAME $ARTIFACT_REPO/$IMAGE_NAME"
DOCKER_PUSH_COMMAND="docker push $ARTIFACT_REPO/$IMAGE_NAME"
DOCKER_PULL_COMMAND="docker pull $ARTIFACT_REPO/$IMAGE_NAME"

echo "On branch: ${GIT_BRANCH}"
echo $DOCKER_BUILD_COMMAND
Expand Down
4 changes: 2 additions & 2 deletions docker/scripts/cloud-init.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ write_files:
ExecStartPre=mount --bind /var/lib/nvidia /var/lib/nvidia
ExecStartPre=mount -o remountexec /var/lib/nvidia
ExecStartPre=/usr/bin/docker-credential-gcr configure-docker --registries us-central1-docker.pkg.dev
ExecStartPre=/usr/bin/docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/base_image:latest
ExecStart=/usr/bin/docker run --rm --name=mlcommons --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 --device /dev/nvidia4:/dev/nvidia4 --device /dev/nvidia5:/dev/nvidia5 --device /dev/nvidia6:/dev/nvidia6 --device /dev/nvidia7:/dev/nvidia7 --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidiactl:/dev/nvidiactl us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/base_image:latest -b true
ExecStartPre=/usr/bin/docker pull europe-west4-docker.pkg.dev/mlcommons-algoperf/algoperf-docker-repo/base_image:latest
ExecStart=/usr/bin/docker run --rm --name=mlcommons --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 --device /dev/nvidia4:/dev/nvidia4 --device /dev/nvidia5:/dev/nvidia5 --device /dev/nvidia6:/dev/nvidia6 --device /dev/nvidia7:/dev/nvidia7 --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidiactl:/dev/nvidiactl europe-west4-docker.pkg.dev/mlcommons-algoperf/algoperf-docker-repo/base_image:latest -b true
StandardOutput=journal+console
StandardError=journal+console

Expand Down
18 changes: 14 additions & 4 deletions docker/scripts/startup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ RSYNC_DATA="true"
OVERWRITE="false"
SAVE_CHECKPOINTS="true"
TUNING_RULESET="external"
ROOT_DATA_BUCKET="algoperf-data"
LOGS_BUCKET="algoperf-runs"

# Pass flag
while [ "$1" != "" ]; do
Expand Down Expand Up @@ -136,6 +138,14 @@ while [ "$1" != "" ]; do
shift
ADDITIONAL_REQUIREMENTS_PATH=$1
;;
--data_bucket)
shift
ROOT_DATA_BUCKET=$1
;;
--logs_bucket)
shift
LOGS_BUCKET=$1
;;
*)
usage
exit 1
Expand Down Expand Up @@ -179,11 +189,11 @@ VALID_WORKLOADS=("criteo1tb" "imagenet_resnet" "imagenet_resnet_silu" "imagenet_
VALID_RULESETS=("self" "external")

# Set data and experiment paths
ROOT_DATA_BUCKET="gs://mlcommons-data"
ROOT_DATA_DIR="${HOME_DIR}/data"
ROOT_DATA_BUCKET="gs://${ROOT_DATA_BUCKET}"

EXPERIMENT_BUCKET="gs://mlcommons-runs"
EXPERIMENT_DIR="${HOME_DIR}/experiment_runs"
EXPERIMENT_LOGS_BUCKET="gs://${LOGS_BUCKET}"

if [[ -n ${DATASET+x} ]]; then
if [[ ! " ${VALID_DATASETS[@]} " =~ " $DATASET " ]]; then
Expand Down Expand Up @@ -313,8 +323,8 @@ if [[ ! -z ${SUBMISSION_PATH+x} ]]; then
RETURN_CODE=$?

if [[ $INTERNAL_CONTRIBUTOR_MODE == "true" ]]; then
/google-cloud-sdk/bin/gsutil -m cp -r ${EXPERIMENT_DIR}/${EXPERIMENT_NAME}/${WORKLOAD}_${FRAMEWORK} ${EXPERIMENT_BUCKET}/${EXPERIMENT_NAME}/
/google-cloud-sdk/bin/gsutil -m cp ${LOG_FILE} ${EXPERIMENT_BUCKET}/${EXPERIMENT_NAME}/${WORKLOAD}_${FRAMEWORK}/
/google-cloud-sdk/bin/gsutil -m cp -r ${EXPERIMENT_DIR}/${EXPERIMENT_NAME}/${WORKLOAD}_${FRAMEWORK} ${EXPERIMENT_LOGS_BUCKET}/${EXPERIMENT_NAME}/
/google-cloud-sdk/bin/gsutil -m cp ${LOG_FILE} ${EXPERIMENT_LOGS_BUCKET}/${EXPERIMENT_NAME}/${WORKLOAD}_${FRAMEWORK}/
fi

fi
Expand Down
4 changes: 2 additions & 2 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ gcloud auth configure-docker $ARTIFACT_REGISTRY_URL
To pull the latest prebuilt image:

```bash
docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/<image_name>
docker pull europe-west4-docker.pkg.dev/mlcommons-algoperf/algoperf-docker-repo/<image_name>
```

The naming convention for `image_name` is `algoperf_<framework>_<branch>`.
Expand All @@ -102,7 +102,7 @@ Currently maintained images on the repository are:
- `algoperf_both_dev`

To reference the pulled image you will have to use the full `image_path`, e.g.
`us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_main`.
`europe-west4-docker.pkg.dev/mlcommons-algoperf/algoperf-docker-repo/algoperf_jax_main`.

### Trigger Rebuild and Push of Maintained Images

Expand Down
2 changes: 1 addition & 1 deletion scoring/run_workloads.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@

flags.DEFINE_string(
'docker_image_url',
'us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_dev',
'europe-west4-docker.pkg.dev/mlcommons-algoperf/algoperf-docker-repo/algoperf_jax_dev',
'URL to docker image')
flags.DEFINE_integer(
'run_percentage',
Expand Down