Dockerize CI (1/N) #1225

sjain-stanford · 2022-08-15T15:45:01Z

This is the first step towards dockerizing CI.

PR includes:

Dockerfiles and run scripts to repro the Ubuntu-x86-64 builds exactly locally
GHA workflow modified to use same docker setup to run builds + tests in CI (make Docker flow load-bearing of CI)
ccache mounts within docker container to accelerate builds

PR does not include (can be follow-on):

Unify with build_tools/python_deploy/build_linux_packages.sh
Dockerfile updates to ubuntu / python versions

Local repro steps:

# Build an image and launch an interactive docker container
./build_tools/docker/run_docker.sh

# Run cmake build (either in-tree or out-of-tree)
./build_tools/docker/run_cmake_build{_oot}.sh

# Run torch-mlir unit tests (+ python + dialect LIT tests)
./build_tools/docker/run_unit_tests.sh

# Run torch-mlir integration tests
./build_tools/docker/run_integration_tests.sh

This should provide much needed relief with issues reproducing CI failures locally and avoid any environment discrepancies.

powderluv · 2022-08-15T17:55:47Z

My suggestion from the previous discussion #1162 (comment) is to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run. This gives us flexibility to choose which Docker image we want (manylinux for Release, Ubuntu 22.04 for CIs).

sjain-stanford · 2022-08-15T19:26:59Z

My suggestion from the previous discussion #1162 (comment) is to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run. This gives us flexibility to choose which Docker image we want (manylinux for Release, Ubuntu 22.04 for CIs).

Thanks @powderluv . I had a discussion with my team around these requirements (cc: @sanjoy , @asaadaldien ) and we think a good definition of "what is the minimum we need to get this going" would help us.

As noted in the description, this PR doesn't touch on the "unification with release builds" work. That would require more work along these lines:

Both CI and release flows use the same environments (which is not the case today)
Unify the cmake build step to use setup.py etc (which is not the case today)

In our opinion, these are orthogonal feature requests and need not be tied to the scope of this PR # 1/N. This PR simply preserves the existing CI and adds a docker wrapper to run things both locally + in CI reproducibly, which should suffice to enable the docker reproducers we need, and addresses @silvasean 's request to make it load bearing of CI.

to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run.

While eliminating the need for a dockerfile may sound okay for CI / release builds, this does impact local flows and fast-turnaround requirements from local runs because it doesn't utilize layered caching from docker (because each local invocation would be a fresh install at runtime). Maybe I'm missing details, could you elaborate on why want to avoid a Dockerfile in the first place?

asaadaldien · 2022-08-16T05:01:47Z

@powderluv I think running docker in GHA isn't making things worse in fact making it better, because it makes it easy for downstream users & contributors to reproduce same build artifacts as GHA without running GHA.

asaadaldien · 2022-08-16T05:02:08Z

My suggestion from the previous discussion #1162 (comment) is to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run. This gives us flexibility to choose which Docker image we want (manylinux for Release, Ubuntu 22.04 for CIs).

@powderluv Can you please elaborate, not sure I follow how do we choose between multiple Docker images without adding docker files ?

powderluv · 2022-08-16T10:33:36Z

My suggestion from the previous discussion #1162 (comment) is to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run. This gives us flexibility to choose which Docker image we want (manylinux for Release, Ubuntu 22.04 for CIs).

Thanks @powderluv . I had a discussion with my team around these requirements (cc: @sanjoy , @asaadaldien ) and we think a good definition of "what is the minimum we need to get this going" would help us.

Don't add yet another dev workflow. We have the native CMake flows (OOT, In tree), setup.py and then Release build docker flow. I would suggest we use the Release docker flow instead of yet another flow. I have detailed it here:
Dockerize CI for robust local reproducers #1162 (comment) and some comments here: [Docker] Minimal setup to run CMake builds + unit tests #1156
Avoid docker build in the workflow as much as possible. If we have to use we have to build and push the base image and build from it like IREE. (iree-org/iree@8daf6d1)
Use Ubuntu 22.04 like the current CI.

As noted in the description, this PR doesn't touch on the "unification with release builds" work. That would require more work along these lines:

Both CI and release flows use the same environments (which is not the case today)

Yeah release has to use manylinux.

Unify the cmake build step to use setup.py etc (which is not the case today)

This is not required. We just call the CI commands today in the script.

In our opinion, these are orthogonal feature requests and need not be tied to the scope of this PR # 1/N. This PR simply preserves the existing CI and adds a docker wrapper to run things both locally + in CI reproducibly, which should suffice to enable the docker reproducers we need, and addresses @silvasean 's request to make it load bearing of CI.

to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run.

While eliminating the need for a dockerfile may sound okay for CI / release builds, this does impact local flows and fast-turnaround requirements from local runs because it doesn't utilize layered caching from docker (because each local invocation would be a fresh install at runtime). Maybe I'm missing details, could you elaborate on why want to avoid a Dockerfile in the first place?

I want to avoid us taking on a heavyweight docker release process (like IREE).

What are we building that we don't already have ? Can we avoid the build stage and just use Docker images ? Pip already caches your packages locally and in your Docker container when you are doing local runs.
Running docker build every time also fills up your build cache that requires purging / managing.

powderluv · 2022-08-16T10:43:44Z

@powderluv I think running docker in GHA isn't making things worse in fact making it better, because it makes it easy for downstream users & contributors to reproduce same build artifacts as GHA without running GHA.

Oh I hope I didn't say running docker makes it worse. I am saying it will help us all to avoid yet another new flow if we did #1162 (comment)

So a dev doesn't have to do:

Their normal build
Their CI docker flow
Their release docker flow

The last two can be one entry point

powderluv · 2022-08-16T10:46:06Z

My suggestion from the previous discussion #1162 (comment) is to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run. This gives us flexibility to choose which Docker image we want (manylinux for Release, Ubuntu 22.04 for CIs).

@powderluv Can you please elaborate, not sure I follow how do we choose between multiple Docker images without adding docker files ?

We already have a way for release builds so it would look something like (whatever we want the env vars to be):

TM_PYTHON_VERSIONS=cp310-cp310 MANYLINUX_DOCKER_IMAGE=nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04 TORCH_BINARY=OFF PACKAGES="torch-mlr OOT-CI" ./build_tools/python_deploy/build_linux_packages.sh

powderluv · 2022-08-16T11:38:06Z

Probably easier to just explain it in code #1234 . It provides a single entry point for end users to replicate the CI/Release.

It allows for the same workflow to run any docker image you want -- build all CI build and Release builds in one go. You can also pick the Python versions and if you want PyTorch versions in source or binary form.

It only needs filling out of the actual CMake commands.

@sjain-stanford didn't want to hijack your PR so please take over any / all parts of #1234.

asaadaldien · 2022-08-16T13:59:14Z

TM_PYTHON_VERSIONS=cp310-cp310 MANYLINUX_DOCKER_IMAGE=nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04 TORCH_BINARY=OFF PACKAGES="torch-mlr OOT-CI" ./build_tools/python_deploy/build_linux_packages.sh

Thanks for explaining @powderluv, to test my understanding we want CI to run build_linux_packages.sh in #1234 to unify CI & release flows, and parameterize it such that CI buildAndTest.yaml pulls from specific docker image but downstream users are free to use what ever docker image, is that correct?

sjain-stanford · 2022-08-16T14:18:45Z

So a dev doesn't have to do:
Their normal build
Their CI docker flow
Their release docker flow

Correct me if I'm wrong but we wouldn't need to have (1) and (2) split anymore, as the docker flow would eliminate the need for (1). CI is transparent to developers and they only need to care about:

Developer build (mirrored in CI)
Release flow

I am hesitant in combining the developer flow with the build release flow because they address different requirements:
One is fast, heavily cached for quick signals from repeated runs many times in an hour, the other is slow, doesn't /shouldn't use a cache, used for once-a-day sort of releases.

sjain-stanford · 2022-08-16T14:27:04Z

Don't add yet another dev workflow. We have the native CMake flows (OOT, In tree), setup.py and then Release build docker flow. I would suggest we use the Release docker flow instead of yet another flow.

The docker flow supports both native CMake flows (OOT, in-tree). It just runs them in a container rather than natively (which is more robust and guarantees clean CI if local works, so devs don't need to run CI separately). So with this PR we'd be reducing the # of dev workflows from 3 to 2:

Native CMake flow (both OOT and in-tree supported) -> what they test locally is guaranteed to mirror in CI
Release build flow (uses setup.py for CMake) -> this specifically tests the python wheel interface

powderluv · 2022-08-16T14:29:04Z

So a dev doesn't have to do:

Their normal build

Their CI docker flow

Their release docker flow

Correct me if I'm wrong but we wouldn't need to have (1) and (2) split anymore, as the docker flow would eliminate the need for (1). CI is transparent to developers and they only need to care about:

Developer build (mirrored in CI)

Release flow

I am hesitant in combining the developer flow with the build release flow because they address different requirements:

One is fast, heavily cached for quick signals from repeated runs many times in an hour, the other is slow, doesn't /shouldn't use a cache, used for once-a-day sort of releases.

Having learned from the libtorch changes I won't change current developer behavior.

So it becomes:

Light weight developer flow (cmake, bazel, oot etc)
Local CI / Release recreate to check PRs before submitting.
CI and Release on GHA

The second and third use the same scripts. We can try to bridge the gap between 1 and 2.

powderluv · 2022-08-16T14:29:48Z

So a dev doesn't have to do:

Their normal build

Their CI docker flow

Their release docker flow

Correct me if I'm wrong but we wouldn't need to have (1) and (2) split anymore, as the docker flow would eliminate the need for (1). CI is transparent to developers and they only need to care about:

Developer build (mirrored in CI)

Release flow

I am hesitant in combining the developer flow with the build release flow because they address different requirements:

One is fast, heavily cached for quick signals from repeated runs many times in an hour, the other is slow, doesn't /shouldn't use a cache, used for once-a-day sort of releases.

Having learned from the libtorch changes I won't change current developer behavior.

So it becomes:

Light weight developer flow (cmake, bazel, oot etc)
Local CI / Release recreate to check PRs before submitting.
CI and Release on GHA

The second and third use the same scripts. We can try to bridge the gap between 1 and 2 if 2 is slower (with cache etc)

sjain-stanford · 2022-08-16T14:32:15Z

Having learned from the libtorch changes I won't change current developer behavior.
So it becomes:
Light weight developer flow (cmake, bazel, oot etc)
Local CI / Release recreate to check PRs before submitting.
CI and Release on GHA

I'm surely missing some details here. Could you please elaborate what aspect of libtorch requires us to use a "separate" lightweight developer flow that doesn't run inside a docker?

Is it the pytorch binary support? Would it help if we supported installation of pytorch from source inside a docker?

sjain-stanford · 2022-08-16T14:40:13Z

We already have a way for release builds so it would look something like (whatever we want the env vars to be):
TM_PYTHON_VERSIONS=cp310-cp310 MANYLINUX_DOCKER_IMAGE=nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04 TORCH_BINARY=OFF PACKAGES="torch-mlr OOT-CI" ./build_tools/python_deploy/build_linux_packages.sh

The way the build_linux_packages.sh is setup, it just spawns a container to run a build, generate artifacts and exit. This is perfect for releases (no human involvement). What we're looking to support with the docker flow is a local reproducer, which allows users to stay inside the container (interactively) and debug any failures etc.

Here the first command would launch a container interactively, for the next three commands to be run from. So users get the chance to make incremental fixes, re-test without launching another container.

# Build an image and launch an interactive docker container
./build_tools/docker/run_docker.sh

# Run cmake build (either in-tree or out-of-tree)
./build_tools/docker/run_cmake_build{_oot}.sh

# Run torch-mlir unit tests (+ python + dialect LIT tests)
./build_tools/docker/run_unit_tests.sh

# Run torch-mlir integration tests
./build_tools/docker/run_integration_tests.sh

sjain-stanford · 2022-08-16T16:21:56Z

I think I better understand why you'd want to keep the native developer workflow alive in conjunction to the docker workflow - e.g. for macos (which doesn't natively support docker in GHA).

However I'm missing the point on how this is adding "one other workflow users need to worry about".

Currently, any CMake change users make would require changing (1) the local CMake configuration for native builds, (2) the GHA CMake config for CI and (3) release script's setup.py based CMake config.

With this change, we still have the same number of flows right? The only thing this PR changes is with (2) instead of updating the raw GHA workflow blindly (which are difficult to validate locally and require pushing), they can now update the run_cmake_build.sh and validate it locally before pushing. This helps save scarce CI resources and encourages users to validate their patches locally before sending PRs upstream. I'm seeing this as a win-win and net improvement over what we have currently, with no regressions / tradeoffs.

Regarding unifying this docker workflow with the release docker workflow, that seems like a good incremental next step but not required for this PR to be considered.

sjain-stanford · 2022-08-16T16:27:52Z

I'm also happy to close this one if @powderluv you'd like to continue to bring #1234 to completion (include CMake config, ccache, pip cache, run unit and integration tests etc). Anything that helps us validate CMake CI locally seems like a net improvement and we'd benefit from it. It doesn't have to be this way. We (Cruise) are mainly concerned with the bazel builds, so I can send a separate PR out to dockerize just the bazel GHA build, and extend it to include unit and integration tests which should cover all bases for us.

powderluv · 2022-08-16T19:33:56Z

Ok will take over to complete #1234. Will get to it later this week.

sjain-stanford · 2022-08-16T20:24:05Z

Thank you @powderluv. This is very much appreciated. Looking forward to it.

sjain-stanford requested review from powderluv, silvasean and ashay August 15, 2022 15:45

sjain-stanford mentioned this pull request Aug 15, 2022

[MHLO] Init end-to-end unit tests #1223

Merged

sjain-stanford requested review from asaadaldien and sanjoy August 15, 2022 19:46

sjain-stanford force-pushed the sambhav/dockerize_ci branch from c56bc95 to cebca45 Compare August 16, 2022 07:18

sjain-stanford mentioned this pull request Aug 16, 2022

Use Docker for Bazel Build #1232

Closed

Dockerize CI

ce26326

sjain-stanford force-pushed the sambhav/dockerize_ci branch from cebca45 to ce26326 Compare August 16, 2022 19:13

sjain-stanford closed this Aug 16, 2022

Dockerize CI (1/N) #1225

Dockerize CI (1/N) #1225

Uh oh!

Conversation

sjain-stanford commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

powderluv commented Aug 15, 2022

Uh oh!

sjain-stanford commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asaadaldien commented Aug 16, 2022

Uh oh!

asaadaldien commented Aug 16, 2022

Uh oh!

powderluv commented Aug 16, 2022

Uh oh!

powderluv commented Aug 16, 2022

Uh oh!

powderluv commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

powderluv commented Aug 16, 2022

Uh oh!

asaadaldien commented Aug 16, 2022

Uh oh!

sjain-stanford commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjain-stanford commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

powderluv commented Aug 16, 2022

Uh oh!

powderluv commented Aug 16, 2022

Uh oh!

sjain-stanford commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjain-stanford commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjain-stanford commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjain-stanford commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

powderluv commented Aug 16, 2022

Uh oh!

sjain-stanford commented Aug 16, 2022

Uh oh!

Uh oh!

sjain-stanford commented Aug 15, 2022 •

edited

Loading

sjain-stanford commented Aug 15, 2022 •

edited

Loading

powderluv commented Aug 16, 2022 •

edited

Loading

sjain-stanford commented Aug 16, 2022 •

edited

Loading

sjain-stanford commented Aug 16, 2022 •

edited

Loading

sjain-stanford commented Aug 16, 2022 •

edited

Loading

sjain-stanford commented Aug 16, 2022 •

edited

Loading

sjain-stanford commented Aug 16, 2022 •

edited

Loading

sjain-stanford commented Aug 16, 2022 •

edited

Loading