Skip to content

Conversation

sjain-stanford
Copy link
Member

@sjain-stanford sjain-stanford commented Aug 15, 2022

This is the first step towards dockerizing CI.

Fixes #1162

PR includes:

  • Dockerfiles and run scripts to repro the Ubuntu-x86-64 builds exactly locally
  • GHA workflow modified to use same docker setup to run builds + tests in CI (make Docker flow load-bearing of CI)
  • ccache mounts within docker container to accelerate builds

PR does not include (can be follow-on):

  • Unify with build_tools/python_deploy/build_linux_packages.sh
  • Dockerfile updates to ubuntu / python versions

Local repro steps:

# Build an image and launch an interactive docker container
./build_tools/docker/run_docker.sh

# Run cmake build (either in-tree or out-of-tree)
./build_tools/docker/run_cmake_build{_oot}.sh

# Run torch-mlir unit tests (+ python + dialect LIT tests)
./build_tools/docker/run_unit_tests.sh

# Run torch-mlir integration tests
./build_tools/docker/run_integration_tests.sh

This should provide much needed relief with issues reproducing CI failures locally and avoid any environment discrepancies.

@powderluv
Copy link
Collaborator

My suggestion from the previous discussion #1162 (comment) is to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run. This gives us flexibility to choose which Docker image we want (manylinux for Release, Ubuntu 22.04 for CIs).

@sjain-stanford
Copy link
Member Author

sjain-stanford commented Aug 15, 2022

My suggestion from the previous discussion #1162 (comment) is to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run. This gives us flexibility to choose which Docker image we want (manylinux for Release, Ubuntu 22.04 for CIs).

Thanks @powderluv . I had a discussion with my team around these requirements (cc: @sanjoy , @asaadaldien ) and we think a good definition of "what is the minimum we need to get this going" would help us.

As noted in the description, this PR doesn't touch on the "unification with release builds" work. That would require more work along these lines:

  • Both CI and release flows use the same environments (which is not the case today)
  • Unify the cmake build step to use setup.py etc (which is not the case today)

In our opinion, these are orthogonal feature requests and need not be tied to the scope of this PR # 1/N. This PR simply preserves the existing CI and adds a docker wrapper to run things both locally + in CI reproducibly, which should suffice to enable the docker reproducers we need, and addresses @silvasean 's request to make it load bearing of CI.

to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run.

While eliminating the need for a dockerfile may sound okay for CI / release builds, this does impact local flows and fast-turnaround requirements from local runs because it doesn't utilize layered caching from docker (because each local invocation would be a fresh install at runtime). Maybe I'm missing details, could you elaborate on why want to avoid a Dockerfile in the first place?

@asaadaldien
Copy link
Member

@powderluv I think running docker in GHA isn't making things worse in fact making it better, because it makes it easy for downstream users & contributors to reproduce same build artifacts as GHA without running GHA.

@asaadaldien
Copy link
Member

My suggestion from the previous discussion #1162 (comment) is to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run. This gives us flexibility to choose which Docker image we want (manylinux for Release, Ubuntu 22.04 for CIs).

@powderluv Can you please elaborate, not sure I follow how do we choose between multiple Docker images without adding docker files ?

@powderluv
Copy link
Collaborator

My suggestion from the previous discussion #1162 (comment) is to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run. This gives us flexibility to choose which Docker image we want (manylinux for Release, Ubuntu 22.04 for CIs).

Thanks @powderluv . I had a discussion with my team around these requirements (cc: @sanjoy , @asaadaldien ) and we think a good definition of "what is the minimum we need to get this going" would help us.

As noted in the description, this PR doesn't touch on the "unification with release builds" work. That would require more work along these lines:

  • Both CI and release flows use the same environments (which is not the case today)

Yeah release has to use manylinux.

  • Unify the cmake build step to use setup.py etc (which is not the case today)

This is not required. We just call the CI commands today in the script.

In our opinion, these are orthogonal feature requests and need not be tied to the scope of this PR # 1/N. This PR simply preserves the existing CI and adds a docker wrapper to run things both locally + in CI reproducibly, which should suffice to enable the docker reproducers we need, and addresses @silvasean 's request to make it load bearing of CI.

to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run.

While eliminating the need for a dockerfile may sound okay for CI / release builds, this does impact local flows and fast-turnaround requirements from local runs because it doesn't utilize layered caching from docker (because each local invocation would be a fresh install at runtime). Maybe I'm missing details, could you elaborate on why want to avoid a Dockerfile in the first place?

I want to avoid us taking on a heavyweight docker release process (like IREE).

What are we building that we don't already have ? Can we avoid the build stage and just use Docker images ? Pip already caches your packages locally and in your Docker container when you are doing local runs.
Running docker build every time also fills up your build cache that requires purging / managing.

@powderluv
Copy link
Collaborator

@powderluv I think running docker in GHA isn't making things worse in fact making it better, because it makes it easy for downstream users & contributors to reproduce same build artifacts as GHA without running GHA.

Oh I hope I didn't say running docker makes it worse. I am saying it will help us all to avoid yet another new flow if we did #1162 (comment)

So a dev doesn't have to do:

  • Their normal build
  • Their CI docker flow
  • Their release docker flow

The last two can be one entry point

@powderluv
Copy link
Collaborator

powderluv commented Aug 16, 2022

My suggestion from the previous discussion #1162 (comment) is to avoid a Dockerfile and a docker build step. All the deps already can be done as part of the setup / run. This gives us flexibility to choose which Docker image we want (manylinux for Release, Ubuntu 22.04 for CIs).

@powderluv Can you please elaborate, not sure I follow how do we choose between multiple Docker images without adding docker files ?

We already have a way for release builds so it would look something like (whatever we want the env vars to be):

TM_PYTHON_VERSIONS=cp310-cp310 MANYLINUX_DOCKER_IMAGE=nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04 TORCH_BINARY=OFF PACKAGES="torch-mlr OOT-CI" ./build_tools/python_deploy/build_linux_packages.sh

@powderluv
Copy link
Collaborator

Probably easier to just explain it in code #1234 . It provides a single entry point for end users to replicate the CI/Release.

It allows for the same workflow to run any docker image you want -- build all CI build and Release builds in one go. You can also pick the Python versions and if you want PyTorch versions in source or binary form.

It only needs filling out of the actual CMake commands.

@sjain-stanford didn't want to hijack your PR so please take over any / all parts of #1234.

@asaadaldien
Copy link
Member

TM_PYTHON_VERSIONS=cp310-cp310 MANYLINUX_DOCKER_IMAGE=nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04 TORCH_BINARY=OFF PACKAGES="torch-mlr OOT-CI" ./build_tools/python_deploy/build_linux_packages.sh

Thanks for explaining @powderluv, to test my understanding we want CI to run build_linux_packages.sh in #1234 to unify CI & release flows, and parameterize it such that CI buildAndTest.yaml pulls from specific docker image but downstream users are free to use what ever docker image, is that correct?

@sjain-stanford
Copy link
Member Author

sjain-stanford commented Aug 16, 2022

So a dev doesn't have to do:
Their normal build
Their CI docker flow
Their release docker flow

Correct me if I'm wrong but we wouldn't need to have (1) and (2) split anymore, as the docker flow would eliminate the need for (1). CI is transparent to developers and they only need to care about:

  1. Developer build (mirrored in CI)
  2. Release flow

I am hesitant in combining the developer flow with the build release flow because they address different requirements:
One is fast, heavily cached for quick signals from repeated runs many times in an hour, the other is slow, doesn't /shouldn't use a cache, used for once-a-day sort of releases.

@sjain-stanford
Copy link
Member Author

sjain-stanford commented Aug 16, 2022

Don't add yet another dev workflow. We have the native CMake flows (OOT, In tree), setup.py and then Release build docker flow. I would suggest we use the Release docker flow instead of yet another flow.

The docker flow supports both native CMake flows (OOT, in-tree). It just runs them in a container rather than natively (which is more robust and guarantees clean CI if local works, so devs don't need to run CI separately). So with this PR we'd be reducing the # of dev workflows from 3 to 2:

  • Native CMake flow (both OOT and in-tree supported) -> what they test locally is guaranteed to mirror in CI
  • Release build flow (uses setup.py for CMake) -> this specifically tests the python wheel interface

@powderluv
Copy link
Collaborator

So a dev doesn't have to do:

Their normal build

Their CI docker flow

Their release docker flow

Correct me if I'm wrong but we wouldn't need to have (1) and (2) split anymore, as the docker flow would eliminate the need for (1). CI is transparent to developers and they only need to care about:

  1. Developer build (mirrored in CI)

  2. Release flow

I am hesitant in combining the developer flow with the build release flow because they address different requirements:

One is fast, heavily cached for quick signals from repeated runs many times in an hour, the other is slow, doesn't /shouldn't use a cache, used for once-a-day sort of releases.

Having learned from the libtorch changes I won't change current developer behavior.

So it becomes:

  • Light weight developer flow (cmake, bazel, oot etc)
  • Local CI / Release recreate to check PRs before submitting.
  • CI and Release on GHA

The second and third use the same scripts. We can try to bridge the gap between 1 and 2.

@powderluv
Copy link
Collaborator

So a dev doesn't have to do:

Their normal build

Their CI docker flow

Their release docker flow

Correct me if I'm wrong but we wouldn't need to have (1) and (2) split anymore, as the docker flow would eliminate the need for (1). CI is transparent to developers and they only need to care about:

  1. Developer build (mirrored in CI)

  2. Release flow

I am hesitant in combining the developer flow with the build release flow because they address different requirements:

One is fast, heavily cached for quick signals from repeated runs many times in an hour, the other is slow, doesn't /shouldn't use a cache, used for once-a-day sort of releases.

Having learned from the libtorch changes I won't change current developer behavior.

So it becomes:

  • Light weight developer flow (cmake, bazel, oot etc)
  • Local CI / Release recreate to check PRs before submitting.
  • CI and Release on GHA

The second and third use the same scripts. We can try to bridge the gap between 1 and 2 if 2 is slower (with cache etc)

@sjain-stanford
Copy link
Member Author

sjain-stanford commented Aug 16, 2022

Having learned from the libtorch changes I won't change current developer behavior.
So it becomes:
Light weight developer flow (cmake, bazel, oot etc)
Local CI / Release recreate to check PRs before submitting.
CI and Release on GHA

I'm surely missing some details here. Could you please elaborate what aspect of libtorch requires us to use a "separate" lightweight developer flow that doesn't run inside a docker?

Is it the pytorch binary support? Would it help if we supported installation of pytorch from source inside a docker?

@sjain-stanford
Copy link
Member Author

sjain-stanford commented Aug 16, 2022

We already have a way for release builds so it would look something like (whatever we want the env vars to be):
TM_PYTHON_VERSIONS=cp310-cp310 MANYLINUX_DOCKER_IMAGE=nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04 TORCH_BINARY=OFF PACKAGES="torch-mlr OOT-CI" ./build_tools/python_deploy/build_linux_packages.sh

The way the build_linux_packages.sh is setup, it just spawns a container to run a build, generate artifacts and exit. This is perfect for releases (no human involvement). What we're looking to support with the docker flow is a local reproducer, which allows users to stay inside the container (interactively) and debug any failures etc.

Here the first command would launch a container interactively, for the next three commands to be run from. So users get the chance to make incremental fixes, re-test without launching another container.

# Build an image and launch an interactive docker container
./build_tools/docker/run_docker.sh

# Run cmake build (either in-tree or out-of-tree)
./build_tools/docker/run_cmake_build{_oot}.sh

# Run torch-mlir unit tests (+ python + dialect LIT tests)
./build_tools/docker/run_unit_tests.sh

# Run torch-mlir integration tests
./build_tools/docker/run_integration_tests.sh

@sjain-stanford
Copy link
Member Author

sjain-stanford commented Aug 16, 2022

I think I better understand why you'd want to keep the native developer workflow alive in conjunction to the docker workflow - e.g. for macos (which doesn't natively support docker in GHA).

However I'm missing the point on how this is adding "one other workflow users need to worry about".

Currently, any CMake change users make would require changing (1) the local CMake configuration for native builds, (2) the GHA CMake config for CI and (3) release script's setup.py based CMake config.

With this change, we still have the same number of flows right? The only thing this PR changes is with (2) instead of updating the raw GHA workflow blindly (which are difficult to validate locally and require pushing), they can now update the run_cmake_build.sh and validate it locally before pushing. This helps save scarce CI resources and encourages users to validate their patches locally before sending PRs upstream. I'm seeing this as a win-win and net improvement over what we have currently, with no regressions / tradeoffs.

Regarding unifying this docker workflow with the release docker workflow, that seems like a good incremental next step but not required for this PR to be considered.

@sjain-stanford
Copy link
Member Author

sjain-stanford commented Aug 16, 2022

I'm also happy to close this one if @powderluv you'd like to continue to bring #1234 to completion (include CMake config, ccache, pip cache, run unit and integration tests etc). Anything that helps us validate CMake CI locally seems like a net improvement and we'd benefit from it. It doesn't have to be this way. We (Cruise) are mainly concerned with the bazel builds, so I can send a separate PR out to dockerize just the bazel GHA build, and extend it to include unit and integration tests which should cover all bases for us.

@powderluv
Copy link
Collaborator

Ok will take over to complete #1234. Will get to it later this week.

@sjain-stanford
Copy link
Member Author

Thank you @powderluv. This is very much appreciated. Looking forward to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dockerize CI for robust local reproducers
3 participants